Let's start with something familiar: the fitting of 5 data points
$(x_i, y_i)$ to the linear function $y(x) = mx + b$ to determine
values for the slope and intercept ($m$ and $b$). As we have done
before we will be using the curve_fit
function from the optimize
sub-module of scipy
.
Up to this point we have talked about qualitative assessments of whether we should have confidence in the underlying model that we assumed:
In this notebook we will begin the discussion of an another, more quantitative, measure, that should be added to the things we consider when answering the question: to What confidence can we have in the model?, or equivalently, What confidence can we have that we assumed the right function to fit to the data?
NOTE: There are no new computational tools introduced in this notebook. The fitting and plotting is something we have done many times, and simulating data sets that are consistent with a model is also something we have done before. The "new business" idea here is the use of a histogram of values of $\chi^2$ to draw conclusions about our confidence in the straight-line model.
Marty Ligare, August 2020
import numpy as np
from scipy import stats
from scipy import optimize
import matplotlib as mpl
import matplotlib.pyplot as plt
# Following is an Ipython magic command that puts figures in notebook.
%matplotlib notebook
# M.L. modifications of matplotlib defaults
# Changes can also be put in matplotlibrc file,
# or effected using mpl.rcParams[]
mpl.style.use('classic')
plt.rc('figure', figsize = (6, 4.5)) # Reduces overall size of figures
plt.rc('axes', labelsize=16, titlesize=14)
plt.rc('figure', autolayout = True) # Adjusts supblot params for new size
Remember: $$ \chi^2 = \sum_i\frac{\left(y(x_i) - y_i\right)^2}{\alpha_i^2} $$
def f(x,m,b):
'''Simple linear function with slope m and intercept b'''
return m*x + b
def chi2(x, y, u, m, b):
'''Chisquare as a function of data (x, y, and yerr=u), and model
parameters slope and intercept (m and b)'''
return np.sum((y - f(x, m, b))**2/u**2)
# Or: data = np.loadtxt("file.dat")
# Format: [[x1,y1,u1], [x2,y2,u2], ... ] where u1 is uncertainty in y1
data = np.array([[1, 2.947032612427293, 0.5],
[2, 6.168779380682309, 0.5],
[3, 7.1618838821688, 0.5],
[4, 9.590549514954866, 0.5],
[5, 11.20657, 0.5]])
x, y, u = data.T
plt.figure()
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.title('Data set #1')
plt.xlim(0,6)
plt.ylim(0,12)
plt.errorbar(x,y,yerr=u,fmt='o');
popt, pcov = optimize.curve_fit(f, x, y, sigma=u, absolute_sigma=True)
m_data = popt[0]
alpha_m = np.sqrt(pcov[0,0]) # Std error in slope
b_data = popt[1]
alpha_b = np.sqrt(pcov[1,1]) # Std error in intercept
chi2_data = chi2(x,y,u,*popt)
print("slope =", m_data,"+/-", alpha_m,"\n")
print("intercept =", b_data,"+/-", alpha_b,"\n")
print("chi2 from data =", chi2_data)
xc = np.linspace(0,6,201) # quasi-continuous set of x's for plot of function
plt.figure()
plt.title("Data set #1 with best fit line",fontsize=14)
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.axhline(0, color='magenta')
plt.xlim(0,6) # Pad x-range on plot
plt.errorbar(x, y, yerr=u, fmt='o');
plt.plot(xc ,f(xc, m_data, b_data));
plt.figure()
plt.axhline(0,color='magenta')
plt.title('Data set #1 normalized residuals',fontsize=14)
plt.xlabel('$x$')
plt.ylabel('normalized residuals')
plt.grid(True)
plt.errorbar(x,(f(x,m_data,b_data)-y)/u,1,fmt='o')
plt.xlim(0,6);
The normalized residuals in the figure above look ok according to qualitative assessments mentioned, so let's proceed to a more quantitative measure, by looking at the consequences of the model.
n_sim = 10000 # number of data sets to simulate
count = 0 # counter for keeping track of simulated sets
# that give value of chisq > value from real data
chi2_sim = np.zeros(n_sim) # Array for chisq values from simulated sets
for i in range(n_sim):
y_sim = stats.norm.rvs(m_data*x + b_data, u) # Generate simulated data
popt_sim, pcov_sim = optimize.curve_fit(f, x, y_sim, sigma=u, absolute_sigma=True)
a = chi2(x,y_sim,u,*popt_sim) # calculate chisq for simulated data
chi2_sim[i] = a # store value of chisq
if a > chi2_data: # increment counter
count += 1
print("% of simulated sets with chisq larger than that from orig. data =", 100*count/n_sim)
mean = np.mean(chi2_sim)
sigma = np.sqrt(np.var(chi2_sim))
print("mean of simulated values of chisq =",mean)
print("std of simulated values of chisq=", sigma)
nbins = 32 # Number of bins for histogram
low = 0 # low limit for histogram
#high = mean + 5*sigma
high = 16 # high limit for histrogram
plt.figure()
plt.xlabel("$\chi^2$")
plt.ylabel("occurrences")
plt.title("Histogram for data set #1; equal sized bins")
plt.axvline(chi2_data, label='vertical line at $\chi^2 = 3.70$');
out = plt.hist(chi2_sim, nbins, [low,high], label="$\chi^2$ from simulated data")
#out[0],out[1]; # occurrences and bin boundaries (for use later)
plt.legend();
For a given model (here a linear relationship between $x$ and $y$, along with given values of $x_i$ and $\alpha_i$) the histogram gives a probability distribution for values of $\chi^2$ that minimize data sets consistent with the model and uncertainties. Notice that this is a very asymmetric distribution.
The vertical line is at the value of $\chi^2 = 3.70$ from the fit to the original data. This value is right in the range of values we should expect from the model. To be more quantitative, the simulated data sets indicate that there there is a 29% chance of getting a value of $\chi^2$ that is 3.7 or bigger, so it's not surprising that we would get such a value. Conclusion: There is no reason to reject the null hypothesis based an the observed value of $\chi^2$, i.e., there no reason to reject the linear model.
version_information
is from J.R. Johansson (jrjohansson at gmail.com).
See Introduction to scientific computing with Python
for more information and instructions for package installation.
version_information
has been installed system-wide on Bucknell linux networked computers.
%load_ext version_information
version_information numpy, scipy, matplotlib