Minimization with MinKit¶
The main purpose of the MinKit package is to do minimizations of PDFs to data sets, finding the set of parameter values that reproduce better the data. The user must be aware about the type of PDF is trying to fit, and select a proper minimization cuantity (FCN) accordingly. For example, if we are dealing with an unbinned dataset, we need an FCN defined for binned data sets. The same stands for extended and non-extended PDFs. The user must know whether an unbinned extended maximum likelihood FCN is needed or not. To give an idea of how the minimization is done with MinKit, let’s create a simple model, generate some data, and fit it.
[1]:
%matplotlib inline
import matplotlib.pyplot as plt
import minkit
# Build the PDF
x = minkit.Parameter('x', bounds=(10, 20))
c = minkit.Parameter('c', 15, bounds=(10, 20))
s = minkit.Parameter('s', 1, bounds=(0.1, 2))
g = minkit.Gaussian('g', x, c, s)
# Keep the initial values
initials = g.get_values()
# Generate data and do the fit
data = g.generate(10000)
with minkit.minimizer('uml', g, data) as minimizer:
minimizer.migrad()
# Plot the results
values, edges = minkit.data_plotting_arrays(data, bins=100)
centers = 0.5 * (edges[1:] + edges[:-1])
plt.hist(centers, bins=edges, weights=values, histtype='step', color='k', label='data');
gx, pdf_values = minkit.pdf_plotting_arrays(g, values, edges)
plt.plot(gx, pdf_values, color='blue', label='PDF')
plt.legend();
# Reset values
g.set_values(**initials)
FCN = 28288.575555243537 | TOTAL NCALL = 74 | NCALLS = 74 |
EDM = 2.456083689719908e-05 | GOAL EDM = 1e-05 | UP = 1.0 |
Valid | Valid Param | Accurate Covar | PosDef | Made PosDef |
True | True | True | True | False |
Hesse Fail | HasCov | Above EDM | Reach calllim | |
False | True | False | False |
+ | Name | Value | Hesse Error | Minos Error- | Minos Error+ | Limit- | Limit+ | Fixed? |
0 | c | 14.987 | 0.00995622 | 10 | 20 | No | ||
1 | s | 0.995523 | 0.00704013 | 0.1 | 2 | No |
Note that we have called minkit.minimizer in order to create a context and do the minimization. The first argument is the FCN to use, an unbinned maximum likelihood in this case. It is very important that we do not change the properties of the PDF inside the with statement. This is because the PDF has enabled a cache. This means that, if the parameters have been set to constant, the evaluation of the PDF over the data values might have been kept in the cache, to reduce the execution time.
Let’s create a more complex model, adding a background-like contribution, using an exponential.
[2]:
# Make the exponential PDF
k = minkit.Parameter('k', -0.1, bounds=(-1, 0))
e = minkit.Exponential('e', x, k)
# Add the two PDFs
y = minkit.Parameter('y', 0.5, bounds=(0, 1))
pdf = minkit.AddPDFs.two_components('pdf', g, e, y)
data = pdf.generate(10000)
# Keep the initial values
initials = pdf.get_values()
# Minimize the PDF
with minkit.minimizer('uml', pdf, data) as minimizer:
minimizer.migrad()
values, edges = minkit.data_plotting_arrays(data, bins=100)
centers = 0.5 * (edges[1:] + edges[:-1])
plt.hist(centers, bins=edges, weights=values, histtype='step', color='k', label='data');
pdf_centers, pdf_values = minkit.pdf_plotting_arrays(pdf, values, edges)
_, g_values = minkit.pdf_plotting_arrays(pdf, values, edges, component='g')
_, e_values = minkit.pdf_plotting_arrays(pdf, values, edges, component='e')
plt.plot(pdf_centers, e_values, color='orange', linestyle=':', label='Exponential')
plt.plot(pdf_centers, g_values, color='green', linestyle='--', label='Gaussian')
plt.plot(pdf_centers, pdf_values, color='blue', label='PDF')
plt.legend();
# Reset values
pdf.set_values(**initials)
FCN = 41568.87190680871 | TOTAL NCALL = 99 | NCALLS = 99 |
EDM = 5.483245260122549e-06 | GOAL EDM = 1e-05 | UP = 1.0 |
Valid | Valid Param | Accurate Covar | PosDef | Made PosDef |
True | True | True | True | False |
Hesse Fail | HasCov | Above EDM | Reach calllim | |
False | True | False | False |
+ | Name | Value | Hesse Error | Minos Error- | Minos Error+ | Limit- | Limit+ | Fixed? |
0 | y | 0.526921 | 0.00950927 | 0 | 1 | No | ||
1 | c | 14.9758 | 0.0202446 | 10 | 20 | No | ||
2 | s | 1.02638 | 0.020045 | 0.1 | 2 | No | ||
3 | k | -0.10163 | 0.0057237 | -1 | 0 | No |
Let’s now assume we want to know the two yields of the PDFs. To do this we need to modify the model, and do and use an extended likelihood during the minimization.
[3]:
# Build the extended PDF
ng = minkit.Parameter('ng', 5000, bounds=(0, 10000))
ne = minkit.Parameter('ne', 5000, bounds=(0, 10000))
pdf = minkit.AddPDFs.two_components('pdf', g, e, ng, ne)
initials = pdf.get_values()
# Generate and fit
data = pdf.generate(int(ng.value + ne.value))
ng.value = 6000 # vary the yields to flee from the minimum
ne.value = len(data) - ng.value
with minkit.minimizer('ueml', pdf, data) as minimizer:
minimizer.migrad()
values, edges = minkit.data_plotting_arrays(data, bins=100)
centers = 0.5 * (edges[1:] + edges[:-1])
plt.hist(centers, bins=edges, weights=values, histtype='step', color='k', label='data');
pdf_centers, pdf_values = minkit.pdf_plotting_arrays(pdf, values, edges)
_, g_values = minkit.pdf_plotting_arrays(pdf, values, edges, component='g')
_, e_values = minkit.pdf_plotting_arrays(pdf, values, edges, component='e')
plt.plot(pdf_centers, e_values, color='orange', linestyle=':', label='Exponential')
plt.plot(pdf_centers, g_values, color='green', linestyle='--', label='Gaussian')
plt.plot(pdf_centers, pdf_values, color='blue', label='PDF')
plt.legend();
# Reset the values
pdf.set_values(**initials)
FCN = -122354.9230188445 | TOTAL NCALL = 124 | NCALLS = 124 |
EDM = 2.1664581983657007e-05 | GOAL EDM = 1e-05 | UP = 1.0 |
Valid | Valid Param | Accurate Covar | PosDef | Made PosDef |
True | True | True | True | False |
Hesse Fail | HasCov | Above EDM | Reach calllim | |
False | True | False | False |
+ | Name | Value | Hesse Error | Minos Error- | Minos Error+ | Limit- | Limit+ | Fixed? |
0 | ng | 5052.31 | 107.283 | 0 | 10000 | No | ||
1 | ne | 4947.74 | 106.773 | 0 | 10000 | No | ||
2 | c | 15.0023 | 0.0206492 | 10 | 20 | No | ||
3 | s | 1.01169 | 0.0203878 | 0.1 | 2 | No | ||
4 | k | -0.0963174 | 0.00560069 | -1 | 0 | No |
Note that we have specified the ueml (unbinned extended maximum likelihood) FCN. If we set it to uml, we would get wrong values. Let’s now see what happens with binned data sets. In this case, we will use the bml FCN.
[4]:
import numpy as np
values, edges = np.histogram(data['x'].as_ndarray(), bins=100, range=x.bounds)
binned_data = minkit.BinnedDataSet.from_ndarray(edges, x, values)
with minkit.minimizer('bml', pdf, binned_data) as minimizer:
minimizer.migrad()
values, edges = minkit.data_plotting_arrays(data, bins=100)
centers = 0.5 * (edges[1:] + edges[:-1])
plt.hist(centers, bins=edges, weights=values, histtype='step', color='k', label='data');
pdf_centers, pdf_values = minkit.pdf_plotting_arrays(pdf, values, edges)
_, g_values = minkit.pdf_plotting_arrays(pdf, values, edges, component='g')
_, e_values = minkit.pdf_plotting_arrays(pdf, values, edges, component='e')
plt.plot(pdf_centers, e_values, color='orange', linestyle=':', label='Exponential')
plt.plot(pdf_centers, g_values, color='green', linestyle='--', label='Gaussian')
plt.plot(pdf_centers, pdf_values, color='blue', label='PDF')
plt.legend();
# Reset the values
pdf.set_values(**initials)
FCN = 85.809988390516 | TOTAL NCALL = 100 | NCALLS = 100 |
EDM = 2.6451594374739883e-07 | GOAL EDM = 1e-05 | UP = 1.0 |
Valid | Valid Param | Accurate Covar | PosDef | Made PosDef |
True | True | True | True | False |
Hesse Fail | HasCov | Above EDM | Reach calllim | |
False | True | False | False |
+ | Name | Value | Hesse Error | Minos Error- | Minos Error+ | Limit- | Limit+ | Fixed? |
0 | ng | 5053.04 | 7422.08 | 0 | 10000 | No | ||
1 | ne | 4945.67 | 7381.12 | 0 | 10000 | No | ||
2 | c | 15.0018 | 0.0206669 | 10 | 20 | No | ||
3 | s | 1.01206 | 0.0204419 | 0.1 | 2 | No | ||
4 | k | -0.096293 | 0.00560712 | -1 | 0 | No |
Similarly, with the FCN chi2 we would be doing a fit to the chi-square, that is, we would be assuming that the input values, instead of following a Poisson distribution, they follow a Gaussian.