Python Refreshment
Contents
5. Python Refreshment¶
You have programmed in Python. Regardless of your skill level, let us do some refreshing.
5.1. The Python World¶
Function: a block of organized, reusable code to complete certain task.
Module: a file containing a collection of functions, variables, and statements.
Package: a structured directory containing collections of modules and an
__init.py__
file by which the directory is interpreted as a package.Library: a collection of related functionality of codes. It is a reusable chunk of code that we can use by importing it in our program, we can just use it by importing that library and calling the method of that library with period(.).
See, for example, how to build a Python libratry.
Question: How to get the constant \(e\) to an arbitary precision?
The constant is only represented by a given double precision.
import math
print("%0.20f" % math.e)
print("%0.80f" % math.e)
2.71828182845904509080
2.71828182845904509079559829842764884233474731445312500000000000000000000000000000
Now use package decimal
to export with an arbitary precision.
import decimal # for what?
## set the required number digits to 150
decimal.getcontext().prec = 150
decimal.Decimal(1).exp().to_eng_string()
decimal.Decimal(1).exp().to_eng_string()[2:]
'71828182845904523536028747135266249775724709369995957496696762772407663035354759457138217852516642742746639193200305992181741359662904357290033429526'
Question: how to draw a random sample from a normal distribution and evaluate the density and distributions at these points?
from scipy.stats import norm
mu, sigma = 2, 4
mean, var, skew, kurt = norm.stats(mu, sigma, moments='mvsk')
print(mean, var, skew, kurt)
x = norm.rvs(loc = mu, scale = sigma, size = 10)
x
2.0 16.0 0.0 0.0
array([ 9.06761885, -2.6153151 , 0.53189381, -0.51285235, 2.69759196,
-2.80003676, 3.63858027, -4.00849686, -2.07662859, 0.57841288])
The pdf and cdf can be evaluated:
norm.pdf(x, loc = mu, scale = sigma)
array([0.02093759, 0.0512575 , 0.09323919, 0.08187523, 0.09823033,
0.04854598, 0.09170875, 0.03227632, 0.05933396, 0.0936317 ])
5.2. Writing a Function¶
Consider the Fibonacci Sequence \(1, 1, 2, 3, 5, 8, 13, 21, 34, ...\). The next number is found by adding up the two numbers before it. We are going to use 3 ways to solve the problems.
The first is a recursive solution.
def fib_rs(n):
if (n==1 or n==2):
return 1
else:
return fib_rs(n - 1) + fib_rs(n - 2)
%timeit fib_rs(10)
16 µs ± 77.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The second uses dynamic programming memoization.
def fib_dm_helper(n, mem):
if mem[n] is not None:
return mem[n]
elif (n == 1 or n == 2):
result = 1
else:
result = fib_dm_helper(n - 1, mem) + fib_dm_helper(n - 2, mem)
mem[n] = result
return result
def fib_dm(n):
mem = [None] * (n + 1)
return fib_dm_helper(n, mem)
%timeit fib_dm(10)
3.8 µs ± 37.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The third is still dynamic programming but bottom-up.
def fib_dbu(n):
mem = [None] * (n + 1)
mem[1]=1;
mem[2]=1;
for i in range(3,n+1):
mem[i] = mem[i-1] + mem[i-2]
return mem[n]
%timeit fib_dbu(500)
102 µs ± 623 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Apparently, the three solutions have very different performance for
larger n
.
5.3. Variables versus Objects¶
In Python, variables and the objects they point to actually live in two different places in the computer memory. Think of variables as pointers to the objects they’re associated with, rather than being those objects. This matters when multiple variables point to the same object.
x = [1, 2, 3] # create a list; x points to the list
y = x # y also points to the same list in the memory
y.append(4) # append to y
x # x changed!
[1, 2, 3, 4]
Now check their addresses
print(id(x)) # address of x
print(id(y)) # address of y
140183868556160
140183868556160
Nonetheless, some data types in Python are “immutable”, meaning that their values cannot be changed in place. One such example is strings.
x = "abc"
y = x
y = "xyz"
x
'abc'
Now check their addresses
print(id(x)) # address of x
print(id(y)) # address of y
140184909075568
140183889055856
Question: What’s mutable and what’s immutable?
Anything that is a collection of other objects is mutable, except
tuples
.
Not all manipulations of mutable objects change the object rather than create a new object. Sometimes when you do something to a mutable object, you get back a new object. Manipulations that change an existing object, rather than create a new one, are referred to as “in-place mutations” or just “mutations.” So:
All manipulations of immutable types create new objects.
Some manipulations of mutable types create new objects.
Different variables may all be pointing at the same object is preserved through function calls (a behavior known as “pass by object-reference”). So if you pass a list to a function, and that function manipulates that list using an in-place mutation, that change will affect any variable that was pointing to that same object outside the function.
x = [1, 2, 3]
y = x
def append_42(input_list):
input_list.append(42)
return input_list
append_42(x)
[1, 2, 3, 42]
Note that both x
and y
have been appended by \(42\).
5.4. Number Representation¶
Numers in a computer’s memory are represented by binary styles (on and off of bits).
5.4.1. Integers¶
If not careful, It is easy to be bitten by overflow with integers when using Numpy and Pandas in Python.
import numpy as np
x = np.array(2**63 - 1 , dtype='int')
x
# This should be the largest number numpy can display, with
# the default int8 type (64 bits)
array(9223372036854775807)
What if we increment it by 1?
y = np.array(x + 1, dtype='int')
y
# Because of the overflow, it becomes negative!
array(-9223372036854775808)
For vanilla Python, the overflow errors are checked and more digits are allocated when needed, at the cost of being slow.
2**63 * 1000
9223372036854775808000
This number is 1000 times largger than the prior number, but still displayed perfectly without any overflows
5.4.2. Floating Number¶
Standard double-precision floating point number uses 64 bits. Among them, 1 is for sign, 11 is for exponent, and 52 are fraction significand, See https://en.wikipedia.org/wiki/Double-precision_floating-point_format. The bottom line is that, of course, not every real number is exactly representable.
0.1 + 0.1 + 0.1 == 0.3
False
0.3 - 0.2 == 0.1
False
What is really going on?
import decimal
decimal.Decimal(0.1)
Decimal('0.1000000000000000055511151231257827021181583404541015625')
Because the mantissa bits are limited, it can not represent a floating point that’s both very big and very precise. Most computers can represent all integers up to \(2^{53}\), after that it starts skipping numbers.
2.1**53 +1 == 2.1**53
# Find a number larger than 2 to the 53rd
True
x = 2.1**53
for i in range(1000000):
x = x + 1
x == 2.1**53
True
We add 1 to x
by 1000000 times, but it still equal to its initial
value, 2.1**53. This is because this number is too big that computer
can’t handle it with precision like add 1.
Machine epsilon is the smallest positive floating-point number x
such that
1 + x != 1
.
print(np.finfo(float).eps)
print(np.finfo(np.float32).eps)
2.220446049250313e-16
1.1920929e-07
5.5. Data Importation¶
NYC Open Data is great resource of open data. One specific dataset of interest is the Motor Vehicle Collisons-Crashes.
The Motor Vehicle Collisions crash table contains details on the crash event. Each row represents a crash event. The Motor Vehicle Collisions data tables contain information from all police reported motor vehicle collisions in NYC. The police report (MV104-AN) is required to be filled out for collisions where someone is injured or killed, or where there is at least $1000 worth of damage
The data is big. I only downloaded the data from January 1 to 25, 2022.
import pandas as pd
nyc_crash = pd.read_csv("../data/nyc_mv_collisions_202201.csv")
nyc_crash.head(10)
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CROSS STREET NAME | OFF STREET NAME | ... | CONTRIBUTING FACTOR VEHICLE 2 | CONTRIBUTING FACTOR VEHICLE 3 | CONTRIBUTING FACTOR VEHICLE 4 | CONTRIBUTING FACTOR VEHICLE 5 | COLLISION_ID | VEHICLE TYPE CODE 1 | VEHICLE TYPE CODE 2 | VEHICLE TYPE CODE 3 | VEHICLE TYPE CODE 4 | VEHICLE TYPE CODE 5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01/01/2022 | 7:05 | NaN | NaN | NaN | NaN | NaN | EAST 128 STREET | 3 AVENUE BRIDGE | NaN | ... | NaN | NaN | NaN | NaN | 4491172 | Sedan | NaN | NaN | NaN | NaN |
1 | 01/01/2022 | 14:43 | NaN | NaN | 40.769993 | -73.915825 | (40.769993, -73.915825) | GRAND CENTRAL PKWY | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4491406 | Sedan | Sedan | NaN | NaN | NaN |
2 | 01/01/2022 | 21:20 | QUEENS | 11414.0 | 40.657230 | -73.841380 | (40.65723, -73.84138) | 91 STREET | 160 AVENUE | NaN | ... | NaN | NaN | NaN | NaN | 4491466 | Sedan | NaN | NaN | NaN | NaN |
3 | 01/01/2022 | 4:30 | NaN | NaN | NaN | NaN | NaN | Southern parkway | Jfk expressway | NaN | ... | Unspecified | NaN | NaN | NaN | 4491626 | Sedan | Sedan | NaN | NaN | NaN |
4 | 01/01/2022 | 7:57 | NaN | NaN | NaN | NaN | NaN | WESTCHESTER AVENUE | SHERIDAN EXPRESSWAY | NaN | ... | NaN | NaN | NaN | NaN | 4491734 | Sedan | NaN | NaN | NaN | NaN |
5 | 01/01/2022 | 13:07 | QUEENS | 11373.0 | 40.742737 | -73.876430 | (40.742737, -73.87643) | NaN | NaN | 89-22 43 AVENUE | ... | Unspecified | Unspecified | NaN | NaN | 4491843 | Sedan | Sedan | Station Wagon/Sport Utility Vehicle | NaN | NaN |
6 | 01/01/2022 | 14:33 | NaN | NaN | 40.759945 | -73.838700 | (40.759945, -73.8387) | VAN WYCK EXPWY | NaN | NaN | ... | Unspecified | NaN | NaN | NaN | 4491841 | Sedan | Station Wagon/Sport Utility Vehicle | NaN | NaN | NaN |
7 | 01/01/2022 | 6:00 | BROOKLYN | 11222.0 | 40.723910 | -73.948845 | (40.72391, -73.948845) | NaN | NaN | 132 ECKFORD STREET | ... | Unspecified | NaN | NaN | NaN | 4491833 | Sedan | NaN | NaN | NaN | NaN |
8 | 01/01/2022 | 5:17 | NaN | NaN | 40.746930 | -73.848660 | (40.74693, -73.84866) | GRAND CENTRAL PKWY | NaN | NaN | ... | Unsafe Lane Changing | NaN | NaN | NaN | 4491857 | Sedan | Sedan | NaN | NaN | NaN |
9 | 01/01/2022 | 1:30 | NaN | NaN | 40.819157 | -73.960380 | (40.819157, -73.96038) | HENRY HUDSON PARKWAY | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4491344 | Sedan | Station Wagon/Sport Utility Vehicle | NaN | NaN | NaN |
10 rows × 29 columns
There are 29 variables.
nyc_crash.columns
Index(['CRASH DATE', 'CRASH TIME', 'BOROUGH', 'ZIP CODE', 'LATITUDE',
'LONGITUDE', 'LOCATION', 'ON STREET NAME', 'CROSS STREET NAME',
'OFF STREET NAME', 'NUMBER OF PERSONS INJURED',
'NUMBER OF PERSONS KILLED', 'NUMBER OF PEDESTRIANS INJURED',
'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED',
'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED',
'NUMBER OF MOTORIST KILLED', 'CONTRIBUTING FACTOR VEHICLE 1',
'CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3',
'CONTRIBUTING FACTOR VEHICLE 4', 'CONTRIBUTING FACTOR VEHICLE 5',
'COLLISION_ID', 'VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2',
'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5'],
dtype='object')