Skip to content

Instantly share code, notes, and snippets.

@caspercasanova
Created September 21, 2024 03:07
Show Gist options
  • Save caspercasanova/9659288df9804829aeb9cdcaee2ec4d5 to your computer and use it in GitHub Desktop.
Save caspercasanova/9659288df9804829aeb9cdcaee2ec4d5 to your computer and use it in GitHub Desktop.

1. Go through the data sets and organize them in a frequency/relative frequency table. Check that your table satisfies the conditions of a probability distribution. (Round to 3 decimals 0.xxx)

Data Set 1
Leading Digit Frequency Proportion Percent (%) Cumulative Freq
1 28 0.28 28 28
2 23 0.23 23 51
3 10 0.10 10 61
4 9 0.09 9 70
5 9 0.09 9 79
6 7 0.07 7 86
7 8 0.08 8 94
8 4 0.04 4 98
9 2 0.02 2 100
Total: 100 1.00 100.00

image

Dataset 2
Leading Digit Frequency Proportion Percent (%) Cumulative Freq
1 12 0.12 12 12
2 10 0.10 10 22
3 10 0.10 10 32
4 7 0.07 7 39
5 16 0.16 16 55
6 19 0.19 19 74
7 12 0.12 12 86
8 10 0.10 10 96
9 4 0.04 4 100
Total: 100 1.00 100.00

image

3. Determine the mean and standard deviation for each data set’s probability distributions. (Round to 3 decimal places)

Dataset Mean ($\overline{x}$) Standard Deviation ($\sigma_x$)
Dataset 1 $3.3\overline{3}$ $2.307$
Dataset 2 $4.84$ $2.340$
Discuss/Explain:

Compare and contrast the graphs and the descriptive statistics to Benford’s Law and determine whether any set of these tax returns are suspected to be fraudulent. Describe Shape, Center, Spread, and any unusual features.

So the first data set has an obvious right skew and the second graph has more of left skew / bell shape distribution. The first set obviously follows the Benford's law and the second set has a higher distribution for the number 5 which could be a sign of something suspect.

4. Discuss/Explain:

How would you convince someone, like a judge or jury, that the tax returns may be fraudulent based on your data analysis? Take into consideration values that fall above/below 1.0 standard deviation from the mean as being unusual.

I think the best way to explain Benford's Law is that a when a left skew or bell shaped distribution occurs it might be a sign of something aloof. I think its better to think of Benford's Law as a tool to spotting something unusual. Also the size of the data must be pretty big, but in order to convince someone I'd would probably discuss why the Benford's Law has the distribution it does.

5. Suppose that Charlie (from NUMB3RS TV series) has data sets submitted by three different people, and he suspects that one person has fabricated his data.

After counting the occurrences of leading digits in each number of each person’s data set, Charlie records the results in the table below.

Leading Digit Mark Mary John
1 452 390 320
2 264 230 192
3 185 164 176
4 147 120 160
5 117 103 120
6 102 86 102
7 87 75 100
8 78 66 98
9 68 60 92

Use Benford’s Law and supporting statistical graphs and numerical summaries to determine which set of data is most likely to have been fabricated. Discuss/Explain your answer. (You may use the ArtOfStats website for creating graphs and numerical summaries.)

If we are using solely Benford's Law, based on Johns data, his data set has a frequency proportion for the digit one is 23.53% where as for Mark and Mary their percentage is roughly 30%. Benford's Law stats that the distribution of digits would have the digit 1 be roughly 30 percent so I'd suspect John of tampering his data.


Topic 2 Assessing Normality

The validity of any sound statistical analysis is predicated on the fact that the relevant dataset meets certain criteria. These prerequisites ensure that different features of the data, including but not limited to sample size, independence, and shape, are suitable for different statistical tests. In other words, the types of analyses that we can perform are largely dependent on the data parameters. Of these qualities, perhaps the most prolific is the distribution pattern of the data. Ideally, the data will approximately mimic a Normal distribution, a bell-shaped curve centered at the average with most values falling within three standard deviations of the center.

There are a variety of formal and informal methods for assessing the normality of a data set. Formal tests for normality generate a numerical summary of the dataset, compare this summary to the expected summaries for a normally distributed version, and identify regions containing significant deviations. Formal tests require more careful data analysis.

As a result, informal assessment methods, such as Q-Q Plots, have become an increasingly popular and accessible way to evaluate normality. These visualizations are created by plotting the theoretical quantiles against the sample quantiles. Simply put, a standard Q-Q Plot with a tight line of points at roughly a 45-degree angle suggests that the quantiles of the sample data match the quantiles of the theoretical normal distribution. Thus, the data is normally distributed. Additionally, Q-Q Plots have the inherent ability to determine where a dataset deviates from the theoretical normal quantiles. The primary drawback of these tools is their subjective nature. As with most informal methods, the viewer must make the final judgement about the distribution of the data.

Prompt: How Can We Assess Normality?

Questions:

  1. Clearly describe/explain at least three characteristics of all normally distributed data to a person who has little knowledge of normal distributions (bell curve).

    1. The graph tails off towards the horizontal axis on the left and right.
    2. The area under the curve to the left of the mean is equal to the area under the curve onto the right of the mean which is 0.5
    3. The total area under the curve is equal to one.
  2. Is any data perfectly normal? Discuss/Explain why/why not.

    1. I think you can have data fit a normal distribution, but perfectly normal is probably very rare in nature. I am sure synthetic data can be perfectly normal, but I don't have the answer to that one.
  3. Given a set of data and your answer to questions 1 & 2 above, describe/explain at least 2 methods that you have learned in this class that you could use to determine if the data is approximately normal.

    1. I'd use the graphing page you gave us to see the shape of the graph
    2. I'd use a bar chart to see if there are any outliers and to see of the quantiles are equivalent.
  4. Assessing Normality of the United States Unemployment Rates: Use the data from the US Bureau of Labor Statistics, https://www.bls.gov/web/laus/laumstrk.htm#laumstrk.f.p which shows the most current unemployment rates in all 50 states and the District of Columbia (DC). Does the data appear to be approximately normal? Let’s find out.

    Relative frequency table: (Round to 1 decimal percent value, xx.x%)

Class Frequency Relative Frequency
2.0 – 2.4 3 6%
2.5 – 2.9 10 20%
3.0 – 3.4 12 24%
3.5 – 3.9 6 12%
4.0 – 4.4 9 18%
4.5 – 4.9 7 14%
5.0 – 5.4 2 4%
5.5 – 5.9 1 2%

b. Graphical representations: image

Histogram

  • The Histogram looks roughly Bi-modally distributed. There are two peaks so the data is not normally distributed in the sense the shape isn't a bell curve.

Boxplot

  • The Boxplot doesn't seem to have any outliers which is good news, however the data is slightly skewed right. The right 3rd quartile is decently large.

c. Does the data follow the Empirical Rule? Using the calculated mean and standard deviation, complete the table values below and analyze your results against what you would expect from the Empirical Rule. Round to 3 decimal place value for all calculations.

No it is not. It is not Bell Shaped, but it is Symmetric

  • $\text{Mean} (\mu) =3.69$
  • $\text{Median(med)} = 3.5$
  • $\text{Std Dev}(\sigma)= 0.88$
Low Value High Value Frequency of data Percent of data Empirical Rule
($\mu$ +/- 1$\sigma$) 2.81 4.57 not sure what this is 68.2% no
($\mu$ +/- 2$\sigma$) 1.93 5.45 not sure what this is 95.4% no
($\mu$ +/- 3$\sigma$) 1.05 6.33 not sure what this is 99.7% no

Comment on how closely does this data match the Empirical Rule?

d. Construct a Normal Probability Plot (Q-Q plot) for the Unemployment Rate data.

image

e. Measuring skewness: The skewness of a distribution can also be calculated using the formula:

$S = \frac{3(\overline{x} - \text{median})}{\sigma}$

$\text{Skew} = .648$

  1. Discuss/Explain: Do you think that a large set of data is more likely to be normal than a small set of data? For example, if we examined the unemployment rate of ten states and compared it to the unemployment rate for the United States, would we get a different distribution?

A small set has increased odds of be having its distribution disrupted by an outlier. As the size of a data set increases the impact of an outlier becomes less and less.

The chances of are yes, we would have a distribution that would be different than the distribution for the entire set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment