import math
print("%0.20f" % math.e)
print("%0.80f" % math.e)
2.71828182845904509080
2.71828182845904509079559829842764884233474731445312500000000000000000000000000000
Your computer has an operating system (OS), which is responsible for managing the software packages on your computer. Each operating system has its own package management system. For example:
Linux: Linux distributions have a variety of package managers depending on the distribution. For instance, Ubuntu uses APT (Advanced Package Tool), Fedora uses DNF (Dandified Yum), and Arch Linux uses Pacman. These package managers are integral to the Linux experience, allowing users to install, update, and manage software packages easily from repositories.
macOS: macOS uses Homebrew as its primary package manager. Homebrew simplifies the installation of software and tools that aren’t included in the standard macOS installation, using simple commands in the terminal.
Windows: Windows users often rely on the Microsoft Store for apps and software. For more developer-focused package management, tools like Chocolatey and Windows Package Manager (Winget) are used. Additionally, recent versions of Windows have introduced the Windows Subsystem for Linux (WSL). WSL allows Windows users to run a Linux environment directly on Windows, unifying Windows and Linux applications and tools. This is particularly useful for developers and data scientists who need to run Linux-specific software or scripts. It saves a lot of trouble Windows users used to have before its time.
Understanding the package management system of your operating system is crucial for effectively managing and installing software, especially for data science tools and applications.
A file system is a fundamental aspect of a computer’s operating system, responsible for managing how data is stored and retrieved on a storage device, such as a hard drive, SSD, or USB flash drive. Essentially, it provides a way for the OS and users to organize and keep track of files. Different operating systems typically use different file systems. For instance, NTFS and FAT32 are common in Windows, APFS and HFS+ in macOS, and Ext4 in many Linux distributions. Each file system has its own set of rules for controlling the allocation of space on the drive and the naming, storage, and access of files, which impacts performance, security, and compatibility. Understanding file systems is crucial for tasks such as data recovery, disk partitioning, and managing file permissions, making it an important concept for anyone working with computers, especially in data science and IT fields.
Navigating through folders in the command line, especially in Unix-like environments such as Linux or macOS, and Windows Subsystem for Linux (WSL), is an essential skill for effective file management. The command cd (change directory) is central to this process. To move into a specific directory, you use cd
followed by the directory name, like cd Documents
. To go up one level in the directory hierarchy, you use cd ..
. To return to the home directory, simply typing cd
or cd ~
will suffice. The ls
command lists all files and folders in the current directory, providing a clear view of your options for navigation. Mastering these commands, along with others like pwd
(print working directory), which displays your current directory, equips you with the basics of moving around the file system in the command line, an indispensable skill for a wide range of computing tasks in Unix-like systems.
You have programmed in Python. Regardless of your skill level, let us do some refreshing.
__init.py__
file by which the directory is interpreted as a package.See, for example, how to build a Python libratry.
Python’s has an extensive standard library that offers a wide range of facilities as indicated by the long table of contents listed below. See documentation online.
The library contains built-in modules (written in C) that provide access to system functionality such as file I/O that would otherwise be inaccessible to Python programmers, as well as modules written in Python that provide standardized solutions for many problems that occur in everyday programming. Some of these modules are explicitly designed to encourage and enhance the portability of Python programs by abstracting away platform-specifics into platform-neutral APIs.
Question: How to get the constant \(e\) to an arbitary precision?
The constant is only represented by a given double precision.
import math
print("%0.20f" % math.e)
print("%0.80f" % math.e)
2.71828182845904509080
2.71828182845904509079559829842764884233474731445312500000000000000000000000000000
Now use package decimal
to export with an arbitary precision.
import decimal # for what?
## set the required number digits to 150
= 150
decimal.getcontext().prec 1).exp().to_eng_string()
decimal.Decimal(1).exp().to_eng_string()[2:] decimal.Decimal(
'71828182845904523536028747135266249775724709369995957496696762772407663035354759457138217852516642742746639193200305992181741359662904357290033429526'
Question: how to draw a random sample from a normal distribution and evaluate the density and distributions at these points?
from scipy.stats import norm
= 2, 4
mu, sigma = norm.stats(mu, sigma, moments='mvsk')
mean, var, skew, kurt print(mean, var, skew, kurt)
= norm.rvs(loc = mu, scale = sigma, size = 10)
x x
2.0 16.0 0.0 0.0
array([1.00445568, 9.95254072, 3.05981058, 6.41210179, 3.48799882,
7.85509556, 4.40049181, 9.26293574, 3.78529417, 1.66036129])
The pdf and cdf can be evaluated:
= mu, scale = sigma) norm.pdf(x, loc
array([0.09669389, 0.0138209 , 0.09629558, 0.05428184, 0.09306801,
0.03416512, 0.0833 , 0.01918402, 0.09028037, 0.09937669])
Consider the Fibonacci Sequence \(1, 1, 2, 3, 5, 8, 13, 21, 34, ...\). The next number is found by adding up the two numbers before it. We are going to use 3 ways to solve the problems.
The first is a recursive solution.
def fib_rs(n):
if (n==1 or n==2):
return 1
else:
return fib_rs(n - 1) + fib_rs(n - 2)
%timeit fib_rs(10)
7.12 µs ± 114 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
The second uses dynamic programming memoization.
def fib_dm_helper(n, mem):
if mem[n] is not None:
return mem[n]
elif (n == 1 or n == 2):
= 1
result else:
= fib_dm_helper(n - 1, mem) + fib_dm_helper(n - 2, mem)
result = result
mem[n] return result
def fib_dm(n):
= [None] * (n + 1)
mem return fib_dm_helper(n, mem)
%timeit fib_dm(10)
2 µs ± 283 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
The third is still dynamic programming but bottom-up.
def fib_dbu(n):
= [None] * (n + 1)
mem 1] = 1;
mem[2] = 1;
mem[for i in range(3, n + 1):
= mem[i - 1] + mem[i - 2]
mem[i] return mem[n]
%timeit fib_dbu(500)
64.8 µs ± 3.8 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Apparently, the three solutions have very different performance for larger n
.
Here is a function that performs the Monty Hall experiments.
import numpy as np
def montyhall(ndoors, ntrials):
= np.arange(1, ndoors + 1) / 10
doors = np.random.choice(doors, size=ntrials)
prize = np.random.choice(doors, size=ntrials)
player = np.array([np.random.choice([d for d in doors
host if d not in [player[x], prize[x]]])
for x in range(ntrials)])
= np.array([np.random.choice([d for d in doors
player2 if d not in [player[x], host[x]]])
for x in range(ntrials)])
return {'noswitch': np.sum(prize == player), 'switch': np.sum(prize == player2)}
Test it out:
3, 1000)
montyhall(4, 1000) montyhall(
{'noswitch': 270, 'switch': 371}
The true value for the two strategies with \(n\) doors are, respectively, \(1 / n\) and \(\frac{n - 1}{n (n - 2)}\).
In Python, variables and the objects they point to actually live in two different places in the computer memory. Think of variables as pointers to the objects they’re associated with, rather than being those objects. This matters when multiple variables point to the same object.
= [1, 2, 3] # create a list; x points to the list
x = x # y also points to the same list in the memory
y 4) # append to y
y.append(# x changed! x
[1, 2, 3, 4]
Now check their addresses
print(id(x)) # address of x
print(id(y)) # address of y
4996899392
4996899392
Nonetheless, some data types in Python are “immutable”, meaning that their values cannot be changed in place. One such example is strings.
= "abc"
x = x
y = "xyz"
y x
'abc'
Now check their addresses
print(id(x)) # address of x
print(id(y)) # address of y
4505841912
4616240416
Question: What’s mutable and what’s immutable?
Anything that is a collection of other objects is mutable, except tuples
.
Not all manipulations of mutable objects change the object rather than create a new object. Sometimes when you do something to a mutable object, you get back a new object. Manipulations that change an existing object, rather than create a new one, are referred to as “in-place mutations” or just “mutations.” So:
Different variables may all be pointing at the same object is preserved through function calls (a behavior known as “pass by object-reference”). So if you pass a list to a function, and that function manipulates that list using an in-place mutation, that change will affect any variable that was pointing to that same object outside the function.
= [1, 2, 3]
x = x
y
def append_42(input_list):
42)
input_list.append(return input_list
append_42(x)
[1, 2, 3, 42]
Note that both x
and y
have been appended by \(42\).
Numers in a computer’s memory are represented by binary styles (on and off of bits).
If not careful, It is easy to be bitten by overflow with integers when using Numpy and Pandas in Python.
import numpy as np
= np.array(2 ** 63 - 1 , dtype = 'int')
x
x# This should be the largest number numpy can display, with
# the default int8 type (64 bits)
array(9223372036854775807)
Note: on Windows and other platforms, dtype = 'int'
may have to be changed to dtype = np.int64
for the code to execute. Source: Stackoverflow
What if we increment it by 1?
= np.array(x + 1, dtype = 'int')
y
y# Because of the overflow, it becomes negative!
array(-9223372036854775808)
For vanilla Python, the overflow errors are checked and more digits are allocated when needed, at the cost of being slow.
2 ** 63 * 1000
9223372036854775808000
This number is 1000 times larger than the prior number, but still displayed perfectly without any overflows
Standard double-precision floating point number uses 64 bits. Among them, 1 is for sign, 11 is for exponent, and 52 are fraction significand, See https://en.wikipedia.org/wiki/Double-precision_floating-point_format. The bottom line is that, of course, not every real number is exactly representable.
If you have played the Game 24, here is a tricky one:
8 / (3 - 8 / 3) == 24
False
Surprise?
There are more.
0.1 + 0.1 + 0.1 == 0.3
False
0.3 - 0.2 == 0.1
False
What is really going on?
import decimal
0.1) decimal.Decimal(
Decimal('0.1000000000000000055511151231257827021181583404541015625')
8 / (3 - 8 / 3)) decimal.Decimal(
Decimal('23.999999999999989341858963598497211933135986328125')
Because the mantissa bits are limited, it can not represent a floating point that’s both very big and very precise. Most computers can represent all integers up to \(2^{53}\), after that it starts skipping numbers.
2.1 ** 53 + 1 == 2.1 ** 53
# Find a number larger than 2 to the 53rd
True
= 2.1 ** 53
x for i in range(1000000):
= x + 1
x == 2.1 ** 53 x
True
We add 1 to x
by 1000000 times, but it still equal to its initial value, 2.1 ** 53
. This is because this number is too big that computer can’t handle it with precision like add 1.
Machine epsilon is the smallest positive floating-point number x
such that 1 + x != 1
.
print(np.finfo(float).eps)
print(np.finfo(np.float32).eps)
2.220446049250313e-16
1.1920929e-07
Virtual environments in Python are essential tools for managing dependencies and ensuring consistency across projects. They allow you to create isolated environments for each project, with its own set of installed packages, separate from the global Python installation. This isolation prevents conflicts between project dependencies and versions, making your projects more reliable and easier to manage. It’s particularly useful when working on multiple projects with differing requirements, or when collaborating with others who may have different setups.
To set up a virtual environment, you first need to ensure that Python is installed on your system. Most modern Python installations come with the venv module, which is used to create virtual environments. Here’s how to set one up:
python3 -m venv myenv
, where myenv
is the name of the virtual environment to be created. Choose an informative name.This command creates a new directory named myenv
(or your chosen name) in your project directory, containing the virtual environment.
To start using this environment, you need to activate it. The activation command varies depending on your operating system:
myenv\Scripts\activate
.source myenv/bin/activate
or . myenv/bin/activate
.Once activated, your command line will typically show the name of the virtual environment, and you can then install and use packages within this isolated environment without affecting your global Python setup.
To exit the virtual environment, simply type deactivate
in your command line. This will return you to your system’s global Python environment.
As an example, let’s install a package, like numpy
, in this newly created virtual environment:
pip install numpy
.This command installs the requests library in your virtual environment. You can verify the installation by running pip list
, which should show requests along with its version.