This notebook is copy of the posted solution for the second in-class exercise done on the coin-flipping simulation, with four additional lines of code in the final cell.
The goal is to test whether the distribution of the number of heads that you get in repeated experiments of flipping a fair coin 100 times is a normal distribution.
import numpy as np
from scipy import stats
import matplotlib as mpl
import matplotlib.pyplot as plt
# Following is an Ipython magic command that puts figures in notebook.
%matplotlib notebook
# M.L. modification of matplotlib defaults
# Changes can also be put in matplotlibrc file,
# or effected using mpl.rcParams[]
plt.style.use('classic')
plt.rc('figure', figsize = (6, 4.5)) # Reduces overall size of figures
plt.rc('axes', labelsize=16, titlesize=14)
plt.rc('figure', autolayout = True) # Adjusts supblot params for new size
I am going to generate 100 random 0's and 1's, and let 1 represent a "heads" and 0 represent a "tails."
n_flips = 20 # Number of spin flips
data = stats.randint.rvs(0, 2, size=n_flips)
print(data)
I can count the "heads" by summing the array data
.
n_heads = np.sum(data)
print(n_heads)
I store the number of heads in each trial in the array results
.
n_expts = 200
results = np.zeros(n_expts) # Create array in which to store results of experiments
for i in range(n_expts):
results[i] = np.sum(stats.randint.rvs(0, 2, size=n_flips))
Choose bin boundaries so that each bin includes a single integer, e.g., $[47.5, 48.5]$, $[48.5,49.5]$, $[49.5, 50.5]$, etc.
low = 0.5
high = 19.5
nbins = int(high-low)
plt.figure()
plt.xlabel("# of heads")
plt.ylabel("occurrences")
h_out = plt.hist(results, nbins, [low,high])
plt.xlim(low,high);
n_expts = 10000
results = np.zeros(n_expts) # Create array in which to store results of experiments
for i in range(n_expts):
results[i] = np.sum(stats.randint.rvs(0, 2, size=n_flips))
low = 0.5
high = 19.5
nbins = int(high - low)
plt.figure()
plt.xlabel("# of heads")
plt.ylabel("occurrences")
out = plt.hist(results, nbins, [low,high])
plt.xlim(low,high);
Following Hughes and Hase statement of the Central Limit Theorem at the top of p.33, we should look at the distribution of the sample mean:
$$ \langle x\rangle = \frac{1}{N}(x_1 + x_2 + \cdots + x_N). $$It's the distribution of the sample mean that approaches the normal distribution.
The individual values $x_i$ are sampled from a discrete distribution with two values: $x_i = 0$ with a probability of 0.5, and $x_i = 1$ with a probability 0.5. The mean of this parent distribution for the $x_i$'s is $\langle x\rangle = 0.5$, and the standard deviation of this parent distribution is $\sigma = 0.5$.
The histogram below is the same as the histogram immediately above, except that the results for the number of heads in each trial have been divided by $N$, the number of coin-flips in a trial (in this example 100).
The Central Limit Theorem states that total number of heads divided by the number of flips (100) should tend toward a normal distribution with a mean of 0.5 and a standard deviation of $\sigma_\text{parent}/\sqrt{100}$. Let's check.
low = 0.5/n_flips
high = 19.5/n_flips
# Don't change number of bins from previous histogram
plt.figure()
plt.xlabel("fraction of heads")
plt.ylabel("occurrences")
h_out = plt.hist(results/n_flips, nbins, [low,high])
plt.xlim(low,high);
print("sample mean =", np.mean(results/n_flips), ", sample std =", np.std(results/n_flips))
sigma_parent = 1/2
print("predicted std =", sigma_parent/np.sqrt(n_flips) )
Use the CDF of the normal distribution and the bin boundaries to determine the expected occurrences for a normal distribution:
\begin{eqnarray*} P(x_1<x<x_2) &=& \int_{x_1}^{x_2} P_{DF}(x)\, dx \\ &=& \int_{-\infty}^{x_2} P_{DF}(x)\, dx - \int_{-\infty}^{x_1} P_{DF}(x)\, dx \\ &=& C_{DF}(x_2) - C_{DF}(x_1) \end{eqnarray*}Note: Information about the occurrences and bin boundaries is part of the output of the plt.hist()
function. (I gave this output the name h_out
in the plotting cell above.) h_out[0]
is an array with the counts in each bin, and h_out[1]
is an array with the bin boundaries.
h_out
low = 0.5/n_flips
high = 19.5/n_flips
# use nbins previously defined
plt.figure()
plt.xlabel("(# of heads)/100")
plt.ylabel("occurrences")
h_out = plt.hist(results/n_flips, nbins, [low,high], alpha = 0.5)
plt.xlim(low,high);
x = np.zeros(len(h_out[1])-1) # Create array for mid-points of histogram bins
y = np.zeros(len(h_out[1])-1) # Create array for expected occurences from normal dist.
for i in range(len(x)):
x[i] = (h_out[1][i+1] + h_out[1][i])/2
y[i] = n_expts*(stats.norm.cdf(h_out[1][i+1],0.5,0.5/np.sqrt(n_flips))\
- stats.norm.cdf(h_out[1][i],0.5,0.5/np.sqrt(n_flips)))
plt.scatter(x, y, c = 'red');
In a future class we will talk about how to make quantitative assessments of confidence about whether a sample of random variables comes from a given distribution.
We have to calculate the $\chi^2$ statistic:
$$ \chi^2 = \sum_i \frac{(O_i - E_i)^2}{E_i}, $$where $E_i$ is the expected value occurring is some range of values, and $O_i$ is the observed number.
The observed values are part of the output of the plt.hist
function,
and the expected values were calculated as the elements of the array y
that
were calculated for plotting along with the histogram.
Note that there are no constraints, because no parameters were calculated from the data. The null hypothesis of a fair coin is enough information to determine everythin about the expected distribution.
o = h_out[0] # Observed values
e = y # Expected values
dof = len(o) # Numbers of degrees of freedom
chisq = np.sum((e-o)**2/e)
print('chisq = ', chisq)
print('reduced chisq =', chisq/len(o))
print('probability =', 1 - stats.chi2.cdf(chisq, dof))
The reduced $\chi^2$ is about 1, and the probability of getting a value of $\chi^2$ this large is about 60%.
version_information
is from J.R. Johansson (jrjohansson at gmail.com); see Introduction to scientific computing with Python for more information and instructions for package installation.
version_information
is installed on the linux network at Bucknell
%load_ext version_information
version_information numpy, scipy, matplotlib