Question: I flip a fair coin 100 times. What is the probability distribution of the number of heads that I get.
import numpy as np
from scipy import stats
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib notebook
# M.L. modification of matplotlib defaults
# Changes can also be put in matplotlibrc file,
# or effected using mpl.rcParams[]
mpl.style.use('classic')
plt.rc('figure', figsize = (6, 4.5)) # Reduces overall size of figures
plt.rc('axes', labelsize=16, titlesize=14)
plt.rc('figure', autolayout = True) # Adjusts supblot parameters for new size
I am going to generate 100 random 0's and 1's, and let 1 represent a "heads" and 0 represent a "tails."
n_flips = 100 # Number of coin flips
data = stats.randint.rvs(0, 2, size=n_flips)
print(data)
I can count the "heads" by summing the array data
.
n_heads = np.sum(data)
print(n_heads)
I store the number of heads in each trial in the array results
.
n_expts = 200 # umber of experiments to simulate
results = np.zeros(n_expts) # Create array for results of "experiments"
for i in range(n_expts):
results[i] = np.sum(stats.randint.rvs(0, 2, size=n_flips)) #Note that each experiment consists of 100 coin flips
Choose bin boundaries so that each bin includes a single integer, e.g., $[47.5, 48.5]$, $[48.5,49.5]$, $[49.5, 50.5]$, etc.
low = 35.5
high = 64.5
nbins = int(high-low)
plt.figure()
plt.xlabel("# of heads")
plt.ylabel("occurrences")
h_out = plt.hist(results, nbins, [low,high])
plt.xlim(low,high);
n_expts = 10000
results = np.zeros(n_expts) # Create array for results of "experiments"
for i in range(n_expts):
results[i] = np.sum(stats.randint.rvs(0, 2, size=n_flips))
low = 35.5
high = 64.5
nbins = int(high - low)
plt.figure()
plt.xlabel("# of heads")
plt.ylabel("occurrences")
out = plt.hist(results, nbins, [low,high])
plt.xlim(low,high);
Following Hughes and Hase statement of the Central Limit Theorem at the top of p.33, we should look at the distribution of the sample mean:
$$ \langle x\rangle = \frac{1}{N}(x_1 + x_2 + \cdots + x_N). $$It's the distribution of the sample mean that approaches the normal distribution.
The individual values $x_i$ are sampled from a discrete distribution with two values: $x_i = 0$ with a probability of 0.5, and $x_i = 1$ with a probability 0.5. The mean of this parent distribution for the $x_i$'s is $\langle x\rangle = 0.5$, and the standard deviation of this parent distribution is $\sigma = 0.5$.
The histogram below is the same as the histogram immediately above, except that the results for the number of heads in each trial have been divided by $N$, the number of coin-flips in a trial (in this example 100).
The Central Limit Theorem states that total number of heads divided by the number of flips (100) should tend toward a normal distribution with a mean of 0.5 and a standard deviation of $\sigma_\text{parent}/\sqrt{100}$. Let's check.
low = 35.5/n_flips
high = 64.5/n_flips
# Don't change number of bins from previous histogram
plt.figure()
plt.xlabel("fraction of heads")
plt.ylabel("occurrences")
h_out = plt.hist(results/n_flips, nbins, [low,high])
plt.xlim(low,high);
print("sample mean =", np.mean(results/n_flips), ", sample std =", np.std(results/n_flips))
sigma_parent = 1/2
print("predicted std =", sigma_parent/np.sqrt(n_flips) )
Use the CDF of the normal distribution and the bin boundaries to determine the expected occurrences for a normal distribution:
\begin{eqnarray*} P(x_1<x<x_2) &=& \int_{x_1}^{x_2} P_{DF}(x)\, dx \\ &=& \int_{-\infty}^{x_2} P_{DF}(x)\, dx - \int_{-\infty}^{x_1} P_{DF}(x)\, dx \\ &=& C_{DF}(x_2) - C_{DF}(x_1) \end{eqnarray*}Note: Information about the occurrences and bin boundaries is part of the output of the plt.hist()
function. (I gave this output the name h_out
in the plotting cell above.) h_out[0]
is an array with the counts in each bin, and h_out[1]
is an array with the bin boundaries.
h_out
low = 35.5/n_flips
high = 64.5/n_flips
# use nbins previously defined
plt.figure()
plt.xlabel("(# of heads)/100")
plt.ylabel("occurrences")
h_out = plt.hist(results/n_flips, nbins, [low,high], alpha = 0.5)
plt.xlim(low,high);
x = np.zeros(len(h_out[1])-1) # Create empty array for mid-points of histogram bins
y = np.zeros(len(h_out[1])-1) # Create empty array for expected occurences from normal dist.
for i in range(len(x)):
x[i] = (h_out[1][i+1] + h_out[1][i])/2 # fill x with center-of-bin values
y[i] = n_expts*(stats.norm.cdf(h_out[1][i+1],0.5,0.5/np.sqrt(n_flips))\
- stats.norm.cdf(h_out[1][i],0.5,0.5/np.sqrt(n_flips))) # fill y with probabilities
# based on gaussian pdf
plt.scatter(x, y, c = 'red');
In a future class we will talk about how to make quantitative assessments of confidence about whether a sample of random variables comes from a given distribution.
version_information
is from J.R. Johansson (jrjohansson at gmail.com); see Introduction to scientific computing with Python for more information and instructions for package installation.
version_information
is installed on the linux network at Bucknell
%load_ext version_information
version_information numpy, scipy, matplotlib