In Part A of the Prelude to Chapter 8, we looked at the expected distribution of values of $\chi^2$ for the fit of a straight line to 5 data points. In Part B we repeat exactly the same process, but the model is an exponential decay, and we fit the model to a larger number of data points.
Marty Ligare, August 2020
import numpy as np
from scipy import stats
from scipy import optimize
import matplotlib as mpl
import matplotlib.pyplot as plt
# Following is an Ipython magic command that puts figures in notebook.
%matplotlib notebook
# M.L. modifications of matplotlib defaults
# Changes can also be put in matplotlibrc file,
# or effected using mpl.rcParams[]
mpl.style.use('classic')
plt.rc('figure', figsize = (6, 4.5)) # Reduces overall size of figures
plt.rc('axes', labelsize=16, titlesize=14)
plt.rc('figure', autolayout = True) # Adjusts supblot parameters for new size
Remember: $$ \chi^2 = \sum_i\frac{\left(y(x_i) - y_i\right)^2}{\alpha_i^2} $$
def f(x,q,r,s):
'''Simple exponential decay function (with offset); nonlinear
in the fit parameteres q, r, and s.'''
return q*np.exp(-r*x) + s
def chi2(x,y,u,q,r,s):
'''Chisquare as a function of data (x, y, and yerr=u), and model
parameters q, r, and s'''
return np.sum((y - f(x,q,r,s))**2/u**2)
data = np.array([[0. , 7.46129812, 0.4 ],
[0.2 , 5.85452157, 0.4 ],
[0.4 , 6.07717576, 0.4 ],
[0.6 , 5.4736062 , 0.4 ],
[0.8 , 4.5904504 , 0.4 ],
[1. , 4.79926158, 0.4 ],
[1.2 , 4.55971739, 0.4 ],
[1.4 , 3.60873722, 0.4 ],
[1.6 , 2.86732949, 0.4 ],
[1.8 , 3.54260122, 0.4 ],
[2. , 2.13946228, 0.4 ],
[2.2 , 2.22060587, 0.4 ],
[2.4 , 1.63133617, 0.4 ],
[2.6 , 2.46693153, 0.4 ],
[2.8 , 2.32689348, 0.4 ],
[3. , 1.5128004 , 0.4 ],
[3.2 , 1.20809552, 0.4 ],
[3.4 , 1.43849374, 0.4 ],
[3.6 , 1.36242689, 0.4 ],
[3.8 , 1.73797785, 0.4 ],
[4. , 1.72819865, 0.4 ]])
x, y, u = data.T
plt.figure()
plt.title('Data set #2')
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.xlim(-0.5,4.5)
plt.ylim(0,8)
plt.errorbar(x,y,yerr=u,fmt='o');
p0 = 5, 1, 1 # Initial estimates of q, r, and s
popt, pcov = optimize.curve_fit(f, x, y, p0, sigma=u, absolute_sigma=True)
chi2_data = chi2(x,y,u,*popt)
print("chi2_data =", chi2_data)
xc = np.linspace(0,6,201) # quasi-continuous set of x's for function plot
plt.figure()
plt.title("Data set #2 with best fit function")
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.axhline(0, color='magenta')
plt.xlim(-0.5,4.5) # Pad x-range on plot
plt.errorbar(x, y, yerr=u, fmt='o');
plt.plot(xc ,f(xc, *popt));
plt.figure()
plt.axhline(0,color='magenta')
plt.xlabel('$x$')
plt.ylabel('normalized residuals')
plt.title('Data set #2: normalized residuals')
plt.grid(True)
plt.errorbar(x,(f(x,*popt)-y)/u,1,fmt='o')
plt.xlim(-0.5,5.5);
Normalized residuals look ok according to qualitative assessments mentioned, so precede to more quantitative measures.
n_sim = 10000 # Number of data sets to simulate
count = 0
chi2_sim = np.zeros(n_sim) # Array for chisq valuesfrom simulated sets
for i in range(n_sim):
y_sim = stats.norm.rvs(f(x,*popt), u) # Generate simulated data set
popt_sim, pcov_sim = optimize.curve_fit(f, x, y_sim, popt, sigma=u, absolute_sigma=True)
a = chi2(x,y_sim,u,*popt_sim) # Calculate chisq for simulated data
chi2_sim[i] = a # store value of chisq
if a > chi2_data: # increment counter
count += 1
print("% of simulated sets with chisq larger than that from orig. data =", 100*count/n_sim)
mean = np.mean(chi2_sim)
sigma = np.sqrt(np.var(chi2_sim))
print("mean of simulated values of chisq =", mean)
print("std of simulated values of chisq =", sigma)
nbins = 25 # number of bis
low = 0 # lower limit for histogram
#high = mean + 5*sigma
high = 50 # upper limit for histogram
plt.figure()
plt.xlabel("$\chi^2$")
plt.ylabel("occurrences")
plt.title("Histogram; equal sized bins")
out = plt.hist(chi2_sim, nbins, [low,high], label="$\chi^2$ from sim. data")
#out[0],out[1] # occurrences and bin boundaries (for use in future)
plt.axvline(chi2_data, label='$\chi^2 = 21.5$')
plt.legend();
Again, the histogram gives a probability distribution for values $\chi^2$ that minimize data sets consistent with the model and uncertainties. With more data points, the probability distribution is more symmetric than that in Part A.
Again, the value of $\chi^2$ obtained by fitting the original data is in the range of values we expect from the model and uncertainties. To be more quantitative, there is a 25% chance of getting a value of $\chi^2$ that is 21.5 or bigger, so it's very reasonable to expect a value of about 21.5.
There no reason to reject the exponential decay model based on the value of $\chi^2 = 21.5$.
On the other hand, what if our fit had returned a value of $\chi^2 = 40$? (This is more than 3 standard deviations about the mean of the data in the historgram.) Zooming in on the histogram shows that there only 17 out of the 10,000 simulated data sets (0.0017%), that have a value of $\chi^2$ greater than 40. It's pretty unlikely that we would get a value of $\chi^2=40$ from our model and random flucuations.
What if our fit had returned a value of $\chi^2 = 4$? Inspection of the histrogram data shows that only 44 of the 10,000 simulated data sets returned a value less than 4, or equivalently 99.56% of the trials returned a value of $\chi^2$ greater than 4. This would be a good indication that our assumed error bars were too big.
version_information
is from J.R. Johansson (jrjohansson at gmail.com); see Introduction to scientific computing with Python for more information and instructions for package installation.
version_information
is installed on the linux network at Bucknell
%load_ext version_information
version_information numpy, scipy, matplotlib