Using the $\chi^2$ test of a distribution: Is this a fair die?

In this notebook I provide two data sets from the roll of a nominally standard 6-sided die. The question is: Is this a fair die, or not? The data is an array of integers chosen from the set {0,1,2,3,4,5}. (We could make this set be {1,2,3,4,5,6} to correspond to the numbers on a die, but with python it's easier to start at 0.)

Use the procedure outlined at top of p.112 in Hughes and Hase, 2and calculate the $\chi^2$ statistic:

$$ \chi^2 = \sum_i \frac{(O_i - E_i)^2}{E_i} $$

where $O_i$ represents the number of occurrences of something in the sample data, and $E_i$ gives the expected number of occurrences based on the null hypothesis.

Marty Ligare, August 2020

In [1]:
import numpy as np
from scipy import stats
from scipy import optimize

import matplotlib as mpl
import matplotlib.pyplot as plt
In [2]:
# Following is an Ipython magic command that puts figures in the  notebook.
%matplotlib notebook
        
# M.L. modifications of matplotlib defaults
# Changes can also be put in matplotlibrc file, 
# or effected using mpl.rcParams[]
mpl.style.use('classic') 
plt.rc('figure', figsize = (6, 4.5))      # Reduces overall size of figures
plt.rc('axes', labelsize=16, titlesize=14)
plt.rc('figure', autolayout = True)       # Adjusts supblot parameters for new size

Data set #1: 1000 rolls

In [3]:
data = np.loadtxt('die_1000.dat', dtype=int)
data
Out[3]:
array([4, 5, 1, 5, 1, 0, 3, 3, 5, 2, 5, 3, 5, 4, 2, 4, 0, 5, 0, 0, 3, 5,
       3, 1, 3, 1, 1, 0, 4, 2, 4, 5, 5, 3, 3, 0, 2, 1, 1, 0, 0, 5, 1, 1,
       1, 0, 5, 4, 4, 4, 3, 0, 3, 3, 5, 4, 5, 1, 5, 5, 0, 5, 1, 4, 1, 5,
       1, 4, 0, 1, 5, 5, 4, 1, 3, 2, 0, 4, 2, 2, 1, 4, 4, 4, 5, 0, 0, 4,
       4, 3, 1, 4, 1, 1, 2, 3, 2, 1, 5, 5, 3, 1, 3, 2, 0, 3, 3, 5, 3, 4,
       3, 5, 3, 5, 3, 3, 3, 2, 3, 0, 5, 3, 0, 3, 1, 2, 2, 4, 0, 1, 4, 5,
       2, 4, 2, 0, 3, 1, 0, 4, 5, 1, 5, 0, 2, 0, 2, 5, 0, 1, 4, 3, 4, 4,
       0, 5, 3, 4, 3, 5, 0, 3, 5, 5, 0, 0, 2, 0, 5, 2, 2, 3, 3, 3, 4, 0,
       4, 5, 0, 3, 1, 0, 5, 5, 3, 1, 0, 4, 1, 1, 2, 1, 2, 3, 4, 5, 2, 2,
       2, 5, 2, 0, 4, 2, 2, 2, 0, 3, 2, 4, 3, 5, 1, 0, 3, 0, 2, 2, 3, 3,
       5, 3, 0, 5, 0, 3, 2, 3, 2, 2, 0, 3, 5, 2, 3, 2, 5, 1, 3, 5, 5, 5,
       5, 2, 1, 5, 3, 4, 1, 2, 3, 3, 5, 3, 4, 5, 4, 5, 2, 1, 3, 3, 0, 0,
       5, 5, 0, 3, 5, 0, 1, 3, 0, 2, 2, 1, 2, 5, 5, 0, 5, 1, 2, 3, 1, 5,
       2, 0, 0, 5, 0, 3, 2, 1, 2, 3, 4, 5, 2, 4, 2, 5, 0, 2, 5, 2, 2, 4,
       2, 4, 4, 0, 3, 3, 5, 1, 1, 5, 1, 0, 5, 5, 5, 1, 0, 0, 3, 5, 2, 5,
       0, 5, 4, 0, 3, 0, 5, 0, 0, 5, 5, 2, 5, 2, 4, 4, 1, 5, 3, 3, 3, 0,
       5, 0, 5, 2, 0, 5, 3, 5, 5, 2, 5, 2, 5, 4, 1, 1, 1, 0, 2, 0, 0, 2,
       1, 3, 4, 3, 4, 2, 2, 0, 4, 3, 1, 2, 3, 2, 5, 4, 0, 1, 5, 4, 1, 3,
       1, 4, 1, 1, 4, 5, 2, 0, 3, 2, 1, 0, 1, 3, 5, 1, 3, 5, 4, 5, 3, 2,
       0, 2, 2, 5, 0, 5, 4, 5, 0, 4, 3, 0, 0, 5, 1, 3, 3, 1, 4, 5, 5, 4,
       3, 5, 5, 5, 3, 5, 5, 2, 4, 4, 0, 3, 1, 1, 2, 5, 5, 4, 2, 1, 4, 5,
       4, 2, 5, 3, 2, 4, 1, 3, 0, 4, 5, 1, 1, 2, 0, 4, 3, 5, 1, 5, 3, 2,
       4, 4, 3, 4, 5, 4, 4, 5, 2, 1, 4, 0, 5, 3, 0, 4, 1, 3, 5, 5, 3, 0,
       1, 0, 2, 5, 1, 2, 4, 3, 4, 0, 3, 0, 4, 3, 3, 2, 4, 1, 3, 3, 2, 1,
       2, 2, 5, 4, 1, 1, 0, 2, 4, 5, 2, 3, 0, 5, 5, 5, 1, 2, 3, 0, 4, 0,
       5, 2, 4, 2, 4, 1, 2, 5, 0, 3, 4, 0, 5, 3, 1, 2, 3, 1, 5, 5, 4, 2,
       4, 0, 4, 5, 0, 3, 2, 4, 0, 3, 2, 5, 5, 1, 5, 0, 5, 1, 4, 3, 0, 1,
       1, 1, 0, 5, 2, 0, 5, 3, 1, 5, 3, 1, 3, 1, 4, 4, 5, 0, 5, 4, 5, 4,
       0, 5, 5, 1, 2, 1, 2, 3, 4, 1, 5, 2, 3, 3, 1, 5, 3, 2, 3, 0, 3, 0,
       5, 3, 2, 0, 0, 1, 2, 5, 3, 2, 2, 5, 5, 3, 1, 0, 4, 4, 4, 0, 2, 2,
       3, 4, 3, 4, 2, 0, 4, 2, 2, 4, 0, 2, 1, 4, 3, 3, 1, 5, 3, 4, 2, 2,
       5, 0, 3, 0, 1, 1, 3, 1, 3, 3, 2, 2, 5, 5, 5, 4, 4, 2, 4, 3, 1, 2,
       2, 0, 4, 4, 3, 3, 1, 5, 5, 4, 4, 1, 2, 5, 1, 5, 5, 2, 2, 4, 5, 3,
       4, 3, 5, 4, 5, 0, 4, 4, 0, 5, 1, 1, 2, 3, 2, 5, 1, 3, 0, 2, 0, 1,
       5, 5, 1, 4, 2, 3, 5, 3, 1, 2, 1, 3, 5, 2, 0, 1, 0, 5, 3, 2, 0, 3,
       3, 1, 0, 2, 2, 5, 1, 4, 1, 2, 3, 0, 1, 5, 5, 3, 3, 1, 5, 2, 1, 2,
       5, 1, 5, 0, 5, 5, 4, 5, 2, 0, 4, 2, 1, 2, 4, 1, 0, 4, 0, 0, 5, 5,
       0, 4, 3, 5, 1, 2, 1, 2, 1, 2, 3, 0, 0, 0, 5, 1, 4, 4, 1, 3, 4, 5,
       1, 4, 5, 2, 1, 2, 1, 5, 0, 2, 3, 5, 4, 2, 2, 4, 1, 0, 2, 0, 2, 0,
       5, 2, 4, 4, 3, 0, 5, 5, 0, 1, 4, 5, 5, 1, 5, 1, 4, 1, 0, 5, 1, 5,
       5, 3, 3, 3, 4, 3, 2, 5, 0, 2, 5, 4, 0, 0, 0, 0, 4, 0, 4, 0, 0, 5,
       0, 0, 0, 3, 4, 2, 4, 3, 1, 4, 0, 0, 5, 1, 4, 4, 3, 1, 1, 5, 3, 0,
       5, 0, 0, 3, 0, 2, 2, 3, 0, 2, 0, 3, 1, 1, 4, 0, 5, 1, 0, 4, 0, 3,
       3, 2, 4, 0, 4, 3, 0, 3, 0, 4, 1, 4, 3, 1, 5, 2, 5, 5, 1, 4, 0, 4,
       3, 5, 1, 0, 0, 0, 1, 3, 3, 1, 4, 0, 3, 1, 2, 5, 3, 0, 0, 2, 4, 5,
       1, 1, 1, 5, 3, 3, 3, 2, 3, 3])

Formulate null hypothesis: the die is fair

That means that each value should occur with a probability of 1/6. for each value

Calculate expected number of occurrences based on null hypothesis

In [4]:
e  = len(data)*np.ones(6)/6.
print(e)
[166.66666667 166.66666667 166.66666667 166.66666667 166.66666667
 166.66666667]

Make a "histogram" of the observed data

Here that just means counting occurrences.

In [5]:
o = np.bincount(data, minlength=6)
print(o)
[166 152 156 173 150 203]
In [6]:
plt.figure()
x = np.linspace(0,5,6)
plt.scatter(x, e, label='expected')
plt.scatter(x, o, color='red', label='observed')
plt.xlabel('value')
plt.ylabel('occurrences')
plt.title('1000 Rolls')
plt.legend(loc='upper left');

Is the high number of occurrences of 5 anomalous?

Calculate $\chi^2$ and find probability that random sampling would result in a value of $\chi^2$ greater than the observed value.

In [7]:
chi2_data = np.sum((o-e)**2/e)
print('chi2 = ',chi2_data)
print('reduced chi2 = ',chi2_data/6)
chi2 =  11.804
reduced chi2 =  1.9673333333333334
In [8]:
p = 1 - stats.chi2.cdf(chi2_data, 6)
print('probability of getting a value of chisq greater than ',chi2_data,'with 6 bins is ',p)
probability of getting a value of chisq greater than  11.804 with 6 bins is  0.06648690925652867

Not a high probability, but not totally unreasonable. Don't reject null hypothesis based on this data.

Data set #2: 50000 rolls

In [9]:
data = np.loadtxt('die_50000.dat', dtype=int)
data
Out[9]:
array([3, 2, 3, ..., 2, 2, 1])

Formulate null hypothesis: the die is fair

That means that each value should occur with a probability of 1/6. for each value

Calculate expected number of occurrences

In [10]:
e  = len(data)*np.ones(6)/6.
print(e)
[8333.33333333 8333.33333333 8333.33333333 8333.33333333 8333.33333333
 8333.33333333]

Make a "histogram" of the observed data

Here that just means counting occurrences.

In [11]:
o = np.bincount(data, minlength=6)
print(o)
[8228 8183 8414 8304 8121 8750]
In [12]:
plt.figure()
x = np.linspace(0,5,6)
plt.scatter(x, e, label='expected')
plt.scatter(x, o, color='red', label='observed')
plt.xlabel('value')
plt.ylabel('occurrences')
plt.title('1000 Rolls')
plt.legend(loc='upper left');

Is the high number of occurrences of 5 anomalous?

Calculate $\chi^2$ and find probability that random sampling would result in a value of $\chi^2$ greater than the observed value.

In [13]:
chi2_data = np.sum((o-e)**2/e)
print('chi2 = ',chi2_data)
print('reduced chi2 = ',chi2_data/6)
chi2 =  31.17112
reduced chi2 =  5.195186666666666
In [14]:
p = 1 - stats.chi2.cdf(chi2_data, 6)
print('probability of getting a value of chisq greater than ',chi2_data,'with 6 bins is ',p)
probability of getting a value of chisq greater than  31.17112 with 6 bins is  2.351167146075195e-05

This is a very low probability. It's safe to reject null hypothesis based on this data; it's very probable that the die is NOT fair.

Version Information

version_information is from J.R. Johansson (jrjohansson at gmail.com). See Introduction to scientific computing with Python for more information and instructions for package installation.

version_information has been installed system-wide on Bucknell linux networked computers.

In [20]:
%load_ext version_information
In [21]:
%version_information numpy, scipy, matplotlib
Out[21]:
SoftwareVersion
Python3.7.8 64bit [GCC 7.5.0]
IPython7.17.0
OSLinux 3.10.0 1062.9.1.el7.x86_64 x86_64 with centos 7.8.2003 Core
numpy1.19.1
scipy1.5.0
matplotlib3.3.0
Thu Aug 20 16:29:33 2020 EDT
In [ ]: