SciPy for Data Science
Contents
6. SciPy for Data Science¶
From SciPy User Guide:
SciPy is a collection of mathematical algorithms and convenience functions built on the NumPy extension of Python. It adds significant power to the interactive Python session by providing the user with high-level commands and classes for manipulating and visualizing data. With SciPy, an interactive Python session becomes a data-processing and system-prototyping environment rivaling systems, such as MATLAB, IDL, Octave, R-Lab, and SciLab.
The additional benefit of basing SciPy on Python is that this also makes a powerful programming language available for use in developing sophisticated programs and specialized applications. Scientific applications using SciPy benefit from the development of additional modules in numerous niches of the software landscape by developers across the world. Everything from parallel programming to web and data-base subroutines and classes have been made available to the Python programmer. All of this power is available in addition to the mathematical libraries in SciPy.
SciPy contains the following modules:
Special functions (
scipy.special
)Integration (
scipy.integrate
)Optimization (
scipy.optimize
)Interpolation (
scipy.interpolate
)Fourier Transforms (
scipy.fft
)Signal Processing (
scipy.signal
)Linear Algebra (
scipy.linalg
)Sparse eigenvalue problems with
ARPACK
Compressed Sparse Graph Routines (
scipy.sparse.csgraph
)Spatial data structures and algorithms (
scipy.spatial
)Statistics (
scipy.stats
)Multidimensional image processing (
scipy.ndimage
)File IO (
scipy.io
)
6.1. Statistics¶
The scipy.stats
module contains functionalities on probability
distributions. For those who are familiar with R, the density,
distribution, quantile, and sampling functions are implemented for a
large number of distributions.
For example, the generalized extreme value (GEV) distribution is
available as scipy.stats.genextreme
. See
Wiki.
from scipy.stats import genextreme
xi = -0.2
n = 10
## a random sample of size n
x = genextreme.rvs(xi, size = n)
## density
genextreme.pdf(x, c = xi)
genextreme.logpdf(x, c = xi) # for loglikelihood
## distribution
genextreme.cdf(x, c = xi)
genextreme.logcdf(x, c = xi) # for loglikelihood
## quantile
genextreme.ppf(genextreme.cdf(x, c = xi), c = xi) == x
array([False, True, True, True, True, True, True, False, True,
False])
genextreme.logpdf(x, c = xi)
array([-1.63039974, -3.58389552, -1.01126159, -1.15448166, -1.1432668 ,
-2.55482708, -1.8898379 , -2.27841851, -1.6561628 , -1.09567913])
See documentation for more.
6.2. Optimization¶
The scipy.optimize
module provides unconstrained and constrained
minimizations with the function minimize()
.
For illustration, consider the \(n\)-variate Rosenbrook function.
def rosen(x, a = 1, b = 100):
"""The Rosenbrock function"""
return sum(b*(x[1:] - x[:-1]**2.0)**2.0 + (a-x[:-1])**2.0)
The important arguments of minimize()
are:
fun
: objective functionx0
: initial guessargs
: extra arguments to the objective function (and its derivatives)method
: type of solver, includingNelder-Mead
,BFGS
, etc.
Derivatives can be supplied via jac
and hess
.
The returned object is of type scipy.optimize.OptimizeResult
. We
need these attributes:
x
: the solution to the optimizationfun
: the value of the minimized objective function (in a likelihood estimation problem, negative loglikelihood)hess_inv
: inverse of the Hessian matrix (in a likelihood estimation problem it is the variance matrix of the MLE)
import numpy as np
from scipy.optimize import minimize
x0 = np.array([1.3, 0.7]) # , 0.8, 1.9, 1.2])
res = minimize(rosen, x0,
method = 'Nelder-Mead',
options = {'xatol': 1e-8, 'disp': True})
res.x
Optimization terminated successfully.
Current function value: 0.000000
Iterations: 79
Function evaluations: 150
array([1., 1.])
The Hessian matrix is not available from the Nelder-Mead method, but
from the BFGS method. Now we use res.x
as initial value and apply
the BFGS method to get hess_inv
.
res2 = minimize(rosen, res.x, method='BFGS',
options = {'maxiter': 0})
res2.x
array([1., 1.])
res2.hess_inv
array([[1, 0],
[0, 1]])
Let’s change the value of a
and b
.
res2 = minimize(rosen, res.x, args=(3, 100),
method = 'Nelder-Mead')
res2.x
array([3.00000422, 9.00002562])