import math
print("%0.20f" % math.e)
print("%0.80f" % math.e)2.71828182845904509080
2.71828182845904509079559829842764884233474731445312500000000000000000000000000000
You have programmed in Python. Regardless of your skill level, let us do some refreshing.
__init.py__ file by which the directory is interpreted as a package.See, for example, how to build a Python libratry.
Python’s has an extensive standard library that offers a wide range of facilities as indicated by the long table of contents listed below. See documentation online.
The library contains built-in modules (written in C) that provide access to system functionality such as file I/O that would otherwise be inaccessible to Python programmers, as well as modules written in Python that provide standardized solutions for many problems that occur in everyday programming. Some of these modules are explicitly designed to encourage and enhance the portability of Python programs by abstracting away platform-specifics into platform-neutral APIs.
Question: How to get the constant \(e\) to an arbitary precision?
The constant is only represented by a given double precision.
import math
print("%0.20f" % math.e)
print("%0.80f" % math.e)2.71828182845904509080
2.71828182845904509079559829842764884233474731445312500000000000000000000000000000
Now use package decimal to export with an arbitary precision.
import decimal # for what?
## set the required number digits to 150
decimal.getcontext().prec = 150
decimal.Decimal(1).exp().to_eng_string()
decimal.Decimal(1).exp().to_eng_string()[2:]
#print(decimal.Decimal(1).exp())'71828182845904523536028747135266249775724709369995957496696762772407663035354759457138217852516642742746639193200305992181741359662904357290033429526'
Question: how to draw a random sample from a normal distribution and evaluate the density and distributions at these points?
from scipy.stats import norm
mu, sigma = 2, 4
mean, var, skew, kurt = norm.stats(mu, sigma, moments='mvsk')
print(mean, var, skew, kurt)
x = norm.rvs(loc = mu, scale = sigma, size = 10)
x
# help(norm.stats)2.0 16.0 0.0 0.0
array([ 4.36370223, 3.15427822, -11.42191397, 6.32565215,
3.25682451, 1.60895864, -2.01842932, 4.31944225,
-3.9236722 , -3.09416904])
The pdf and cdf can be evaluated:
norm.pdf(x, loc = mu, scale = sigma)
# norm.cdf(x, loc = mu, scale = sigma)array([0.08375751, 0.09566822, 0.00035805, 0.05557844, 0.09493189,
0.09926012, 0.06021397, 0.0843018 , 0.03331352, 0.04432582])
Consider the Fibonacci Sequence \(1, 1, 2, 3, 5, 8, 13, 21, 34, ...\). The next number is found by adding up the two numbers before it. We are going to use 3 ways to solve the problems.
The first is a recursive solution.
def fib_rs(n):
if (n==1 or n==2):
return 1
else:
return fib_rs(n - 1) + fib_rs(n - 2)
%timeit fib_rs(10)
# import timeit
# t = timeit.timeit(lambda: fib_rs(10), number=10000)
# print(t, "seconds total")
# print(t/10000, "seconds per call")3.7 μs ± 61 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
The second uses dynamic programming memoization.
def fib_dm_helper(n, mem):
if mem[n] is not None:
return mem[n]
elif (n == 1 or n == 2):
result = 1
else:
result = fib_dm_helper(n - 1, mem) + fib_dm_helper(n - 2, mem)
mem[n] = result
return result
def fib_dm(n):
mem = [None] * (n + 1)
return fib_dm_helper(n, mem)
%timeit fib_dm(10)865 ns ± 22.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
The third is still dynamic programming but bottom-up.
def fib_dbu(n):
mem = [None] * (n + 1)
mem[1] = 1;
mem[2] = 1;
for i in range(3, n + 1):
mem[i] = mem[i - 1] + mem[i - 2]
return mem[n]
%timeit fib_dbu(500)22.8 μs ± 28.8 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
The fourth method: If we only need fib(n) (not all previous values), we can reduce memory usage by storing only the last two values.
def fib_dbu_m(n):
if n <= 2:
return 1
a, b = 1, 1
for _ in range(3, n + 1):
a, b = b, a + b
return b
%timeit fib_dbu_m(500) 12.3 μs ± 119 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
Apparently, the solutions have very different performance for larger n. We could more rigrously analyze the computational complexity and memory usage of each implementation.
Most Python/R programs are built from three core control structures:
if / elif / elsefor or whileThese show up everywhere in data science coding for data cleaning, simulation, feature engineering, model evaluation, etc.
In a sequence, the program executes statements in order. Later lines can use variables created earlier.
# A simple "data pipeline" in sequence: define -> compute -> report
x = [2, 4, 4, 4, 5, 5, 7, 9]
n = len(x)
mean_x = sum(x) / n
var_x = sum((xi - mean_x) ** 2 for xi in x) / (n - 1) # sample variance
sd_x = var_x ** 0.5
mean_x, sd_x(5.0, 2.138089935299395)
A helpful habit: write code as a readable sequence of steps, with meaningful variable names.
Branching lets the program make decisions based on conditions.
def letter_grade(score):
if score >= 93:
return "A"
elif score >= 90:
return "A-"
elif score >= 87:
return "B+"
elif score >= 83:
return "B"
elif score >= 80:
return "B-"
else:
return "below B-"
for s in [97, 91, 88, 81, 72]:
print(s, "->", letter_grade(s))97 -> A
91 -> A-
88 -> B+
81 -> B-
72 -> below B-
Common patterns:
<=, <, >=, >and, or, notinx = 7
(x > 0) and (x < 10), (x in [1, 3, 5, 7])(True, True)
Loops repeat a block of code.
for loops (iterate over a collection)values = [3, 1, 4, 1, 5]
total = 0
for v in values:
total += v
total14
Often you want both index and value:
for i, v in enumerate(values):
print(i, v)0 3
1 1
2 4
3 1
4 5
while loops (repeat until a condition becomes false)Use while when you don’t know in advance how many steps you need.
# Keep doubling until we reach 1000
x = 3
steps = 0
while x < 1000:
x *= 2
steps += 1
x, steps(1536, 9)
break and continuebreak exits the loop earlycontinue skips to the next iterationnums = [2, 7, -1, 5, 0, 9]
for n in nums:
if n < 0:
print("Found a negative, stopping.")
break
if n == 0:
continue
print("Reciprocal:", 1 / n)Reciprocal: 0.5
Reciprocal: 0.14285714285714285
Found a negative, stopping.
Pythonic code often avoids manual loops for simple transformations:
values = [3, 1, 4, 1, 5]
squares = [v**2 for v in values]
big_only = [v for v in values if v >= 4]
squares, big_only([9, 1, 16, 1, 25], [4, 5])
import numpy as np
# Square a large vector: Python loop vs NumPy vectorization
def square_loop(x_list):
out = []
for v in x_list:
out.append(v * v)
return out
def square_vectorized(x_np):
return x_np * x_np
n = 2_000_000
x_list = list(range(n))
x_np = np.arange(n)
# Timing
%timeit square_loop(x_list)
%timeit square_vectorized(x_np)49.6 ms ± 1.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
540 μs ± 1.73 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# Filtering: Python loop vs NumPy boolean masking
def filter_loop(x_list, threshold):
out = []
for v in x_list:
if v >= threshold:
out.append(v)
return out
def filter_vectorized(x_np, threshold):
return x_np[x_np >= threshold]
threshold = n // 2
%timeit filter_loop(x_list, threshold)
%timeit filter_vectorized(x_np, threshold)29.1 ms ± 378 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
613 μs ± 5.24 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In real data science workflows, libraries like NumPy and pandas provide vectorized operations that are faster than Python-level loops. Still, loops are essential for simulations, custom logic, and “glue code”.
Beyond sequence, branching, and looping, Python has a few other common “building blocks” that show up frequently in real programs and data workflows.
Exception handling (try / except) lets your code respond to errors gracefully (e.g., messy input data) instead of crashing.
def safe_float(s):
try:
return float(s)
except ValueError:
return np.nan
[safe_float(x) for x in ["3.14", "oops", "2.0"]][3.14, nan, 2.0]
Iteration helpers make loops cleaner and less error-prone.
names = ["Ada", "Linus", "Grace"]
scores = [98, 85, 92]
for i, (name, score) in enumerate(zip(names, scores), start=1):
print(i, name, score)1 Ada 98
2 Linus 85
3 Grace 92
Context managers (with) manage resources automatically—especially files—so they get closed even if something goes wrong.
with open("example.txt", "w") as f:
f.write("Hello!\n")
with open("example.txt", "r") as f:
text = f.read()
text'Hello!\n'
Object-oriented structure (classes) groups data and behavior together. You’ll see this in libraries, and occasionally you’ll write simple classes yourself.
class RunningMean:
def __init__(self):
self.total = 0.0
self.n = 0
def update(self, x):
self.total += x
self.n += 1
def value(self):
return self.total / self.n
rm = RunningMean()
for x in [2, 4, 6]:
rm.update(x)
rm.value()4.0
Here is a function that performs the Monty Hall experiments. In this version, the host opens only one empty door.
import numpy as np
def montyhall(n_doors, n_trials):
doors = np.arange(1, n_doors + 1)
prize = np.random.choice(doors, size=n_trials)
player = np.random.choice(doors, size=n_trials)
host = np.array([np.random.choice([d for d in doors
if d not in [player[x], prize[x]]])
for x in range(n_trials)])
player2 = np.array([np.random.choice([d for d in doors
if d not in [player[x], host[x]]])
for x in range(n_trials)])
return {'noswitch': np.sum(prize == player),
'switch': np.sum(prize == player2)}Test it out with 3 doors.
montyhall(3, 1000){'noswitch': 328, 'switch': 672}
Then with 4 doors
montyhall(4, 1000){'noswitch': 245, 'switch': 360}
The true value for the two strategies with \(n\) doors are, respectively, \(1 / n\) and \(\frac{n - 1}{n (n - 2)}\).
This one avoid loops.
import numpy as np
from typing import Dict, Optional
def montyhall_fast(
n_doors: int,
n_trials: int,
seed: Optional[int] = None
) -> Dict[str, int]:
"""
Run a Monty Hall simulation with `n_doors` doors and `n_trials` repetitions.
The host always opens exactly one empty door that is neither the prize door
nor the player's initial choice.
Parameters
----------
n_doors : int
Total number of doors in the game (must be >= 3).
n_trials : int
Number of independent trials to simulate.
seed : int, optional
Random seed for reproducibility.
Returns
-------
Dict[str, int]
A dictionary with counts of wins under two strategies:
- 'noswitch': staying with the initial choice
- 'switch' : switching after the host reveals one empty door
Examples
--------
>>> results = montyhall_fast(3, 100000, seed=42)
>>> results
{'noswitch': 33120, 'switch': 66880}
"""
# Create a random number generator
rng = np.random.default_rng(seed)
# Initial random assignments
# Generate prize location and player choice for all trials
prize = rng.integers(n_doors, size=n_trials, dtype=np.int32)
player = rng.integers(n_doors, size=n_trials, dtype=np.int32)
# Allocate array for the host's opened door
host = np.empty(n_trials, dtype=np.int32)
# Case 1: player == prize → host excludes only 'player'
# 'same' is a boolean array: same[i] = True if the player
# initially picked the prize in trial i.
same = prize == player
if np.any(same):
k = rng.integers(n_doors - 1, size=np.sum(same), dtype=np.int32)
p = player[same]
# If k is before p, keep it.
# If k is at/after p, shift it up by 1 to “jump over” p.
host[same] = k + (k >= p)
# Case 2: player != prize → host excludes 'player' and 'prize'
diff = ~same
if np.any(diff):
a = np.minimum(player[diff], prize[diff])
b = np.maximum(player[diff], prize[diff])
k = rng.integers(n_doors - 2, size=np.sum(diff), dtype=np.int32)
host[diff] = k + (k >= a) + (k >= (b-1))
# Player switches: exclude 'player' and 'host'
a2 = np.minimum(player, host)
b2 = np.maximum(player, host)
k2 = rng.integers(n_doors - 2, size=n_trials, dtype=np.int32)
player2 = k2 + (k2 >= a2) + (k2 >= (b2-1))
return {
"noswitch": int((prize == player).sum()),
"switch": int((prize == player2).sum()),
}Another faster version uses Numba just-in-time (JIT) compiler Python, which translates a subset of Python and Numpy into fast machine code via LLVM at runtime. A decorator is added to the numeric functions and, after a one-time compile on the first call (“warm-up”), later calls run much faster. One can toggle parallel=True for multi-core speedups.
from typing import Dict, Optional
import numpy as np
from numba import njit, prange
# --- internal helpers (compiled) ---
@njit
def _seed_numba(seed: int) -> None:
# Seed the RNG used inside Numba-compiled code
np.random.seed(seed)
@njit
def _trial_once(n_doors: int) -> (int, int):
"""
Run one Monty Hall trial (host opens exactly one empty door).
Returns (noswitch_win, switch_win) as 0/1 ints.
"""
prize = np.random.randint(0, n_doors)
player = np.random.randint(0, n_doors)
# host picks an empty door not equal to prize or player
if player == prize:
# exclude only 'player' (size n_doors-1)
k = np.random.randint(0, n_doors - 1)
host = k + (1 if k >= player else 0)
else:
# exclude 'player' and 'prize' (size n_doors-2)
a = player if player < prize else prize
b = prize if prize > player else player
k = np.random.randint(0, n_doors - 2)
host = k + (1 if k >= a else 0) + (1 if k >= (b-1) else 0)
# player switches: choose uniformly from doors != player and != host
a2 = player if player < host else host
b2 = host if host > player else player
k2 = np.random.randint(0, n_doors - 2)
player2 = k2 + (1 if k2 >= a2 else 0) + (1 if k2 >= (b2-1) else 0)
noswitch_win = 1 if prize == player else 0
switch_win = 1 if prize == player2 else 0
return noswitch_win, switch_win
@njit(parallel=True)
def _run_parallel(n_doors: int, n_trials: int) -> (int, int):
ns = np.zeros(n_trials, dtype=np.int64)
sw = np.zeros(n_trials, dtype=np.int64)
for i in prange(n_trials):
a, b = _trial_once(n_doors)
ns[i] = a
sw[i] = b
return int(ns.sum()), int(sw.sum())
@njit
def _run_serial(n_doors: int, n_trials: int) -> (int, int):
noswitch = 0
switch = 0
for _ in range(n_trials):
a, b = _trial_once(n_doors)
noswitch += a
switch += b
return noswitch, switch
# --- public API ---
def montyhall_numba(
n_doors: int,
n_trials: int,
seed: Optional[int] = None,
parallel: bool = True,
) -> Dict[str, int]:
"""
Monty Hall (host opens one empty door) using Numba-compiled loops.
Parameters
----------
n_doors : int
Number of doors (>= 3).
n_trials : int
Number of trials to simulate.
seed : int, optional
Seed for reproducibility.
parallel : bool, default True
Use a parallel loop over trials (multi-core).
Returns
-------
Dict[str, int]
{'noswitch': ..., 'switch': ...}
"""
if n_doors < 3:
raise ValueError("n_doors must be >= 3")
if seed is not None:
_seed_numba(int(seed)) # seed the compiled RNG
ns, sw = (_run_parallel(n_doors, n_trials)
if parallel else
_run_serial(n_doors, n_trials))
return {"noswitch": ns, "switch": sw}It wins when n_trials is very large (e.g., 10–100M) where loop overhead is amortized and you give it many threads, or the design avoids massive temporaries (not much in this example).
Let’s see their time comparison.
import timeit
# --- Timing function ---
def benchmark(n_doors: int = 3, n_trials: int = 1_000_00, seed: int = 42) -> None:
print(f"n_doors={n_doors}, n_trials={n_trials}")
# Ensure deterministic where possible
np.random.seed(seed)
# Time original
t_orig = min(timeit.repeat(lambda: montyhall(n_doors, n_trials),
repeat=3, number=1))
print(f"Original (lists): {t_orig:.4f}s")
# Time vectorized
t_vec = min(timeit.repeat(lambda: montyhall_fast(n_doors, n_trials, seed),
repeat=3, number=1))
print(f"Vectorized NumPy: {t_vec:.4f}s")
# Time Numba serial (compile excluded by earlier warm-up)
t_numba_ser = min(timeit.repeat(lambda: montyhall_numba(n_doors, n_trials,
seed, parallel=False), repeat=3, number=1))
print(f"Numba serial: {t_numba_ser:.4f}s")
# Time Numba parallel
t_numba_par = min(timeit.repeat(lambda: montyhall_numba(n_doors, n_trials,
seed, parallel=True), repeat=3, number=1))
print(f"Numba parallel: {t_numba_par:.4f}s")
benchmark(n_doors = 4, n_trials = 1000_000)n_doors=4, n_trials=1000000
Original (lists): 7.3364s
Vectorized NumPy: 0.0287s
Numba serial: 0.0172s
Numba parallel: 0.0021s
In Python, variables and the objects they point to actually live in two different places in the computer memory. Think of variables as pointers to the objects they’re associated with, rather than being those objects. This matters when multiple variables point to the same object.
x = [1, 2, 3] # create a list; x points to the list
y = x # y also points to the same list in the memory
y.append(4) # append to y
x # x changed![1, 2, 3, 4]
Now check their addresses
print(id(x)) # address of x
print(id(y)) # address of y5326254656
5326254656
Nonetheless, some data types in Python are “immutable”, meaning that their values cannot be changed in place. One such example is strings.
x = "abc"
y = x
y = "xyz"
x'abc'
Now check their addresses
print(id(x)) # address of x
print(id(y)) # address of y4347122960
4893014704
Question: What’s mutable and what’s immutable?
Anything that is a collection of other objects is mutable, except tuples.
Not all manipulations of mutable objects change the object rather than create a new object. Sometimes when you do something to a mutable object, you get back a new object. Manipulations that change an existing object, rather than create a new one, are referred to as “in-place mutations” or just “mutations.” So:
Different variables may all be pointing at the same object is preserved through function calls (a behavior known as “pass by object-reference”). So if you pass a list to a function, and that function manipulates that list using an in-place mutation, that change will affect any variable that was pointing to that same object outside the function.
x = [1, 2, 3]
y = x
def append_42(input_list):
input_list.append(42)
return input_list
append_42(x)[1, 2, 3, 42]
Note that both x and y have been appended by \(42\).
Numers in a computer’s memory are represented by binary styles (on and off of bits).
If not careful, It is easy to be bitten by overflow with integers when using Numpy and Pandas in Python.
import numpy as np
x = np.array(2 ** 63 - 1 , dtype = 'int')
x
# This should be the largest number numpy can display, with
# the default int8 type (64 bits)array(9223372036854775807)
Note: on Windows and other platforms, dtype = 'int' may have to be changed to dtype = np.int64 for the code to execute. Source: Stackoverflow
What if we increment it by 1?
y = np.array(x + 1, dtype = 'int')
y
# Because of the overflow, it becomes negative!array(-9223372036854775808)
For vanilla Python, the overflow errors are checked and more digits are allocated when needed, at the cost of being slow.
2 ** 63 * 10009223372036854775808000
This number is 1000 times larger than the prior number, but still displayed perfectly without any overflows.
Standard double-precision floating point number uses 64 bits. Among them, 1 is for sign, 11 is for exponent, and 52 are fraction significand, See https://en.wikipedia.org/wiki/Double-precision_floating-point_format. The bottom line is that, of course, not every real number is exactly representable.
If you have played the Game 24, here is a tricky one:
8 / (3 - 8 / 3) == 24False
Surprise?
There are more.
0.1 + 0.1 + 0.1 == 0.3False
0.3 - 0.2 == 0.1False
What is really going on?
import decimal
decimal.Decimal(0.1)Decimal('0.1000000000000000055511151231257827021181583404541015625')
decimal.Decimal(8 / (3 - 8 / 3))Decimal('23.999999999999989341858963598497211933135986328125')
Because the mantissa bits are limited, it can not represent a floating point that’s both very big and very precise. Most computers can represent all integers up to \(2^{53}\), after that it starts skipping numbers.
2.1 ** 53 + 1 == 2.1 ** 53
# Find a number larger than 2 to the 53rdTrue
x = 2.1 ** 53
for i in range(1000000):
x = x + 1
x == 2.1 ** 53True
We add 1 to x by 1000000 times, but it still equal to its initial value, 2.1 ** 53. This is because this number is too big that computer can’t handle it with precision like add 1.
Machine epsilon is the smallest positive floating-point number x such that 1 + x != 1.
print(np.finfo(float).eps)
print(np.finfo(np.float32).eps)2.220446049250313e-16
1.1920929e-07
Virtual environments in Python are essential tools for managing dependencies and ensuring consistency across projects. They allow you to create isolated environments for each project, with its own set of installed packages, separate from the global Python installation. This isolation prevents conflicts between project dependencies and versions, making your projects more reliable and easier to manage. It’s particularly useful when working on multiple projects with differing requirements, or when collaborating with others who may have different setups.
To set up a virtual environment, you first need to ensure that Python is installed on your system. Most modern Python installations come with the venv module, which is used to create virtual environments. Here’s how to set one up:
python3 -m venv myenv, where myenv is the name of the virtual environment to be created. Choose an informative name.This command creates a new directory named myenv (or your chosen name) in your project directory, containing the virtual environment.
To start using this environment, you need to activate it. The activation command varies depending on your operating system:
myenv\Scripts\activate.source myenv/bin/activate or . myenv/bin/activate.Once activated, your command line will typically show the name of the virtual environment, and you can then install and use packages within this isolated environment without affecting your global Python setup.
To exit the virtual environment, simply type deactivate in your command line. This will return you to your system’s global Python environment.
As an example, let’s install a package, like numpy, in this newly created virtual environment:
pip install numpy.This command installs the requests library in your virtual environment. You can verify the installation by running pip list, which should show requests along with its version.