2 Your Computer as a Tool for Discovery

2.1 Introduction

Many people use computers only through apps, clicking buttons designed by someone else. Data science works differently. Instead of staying inside fixed menus, you will learn to give the computer precise instructions so it can help you ask questions, test ideas, and make new discoveries. Thinking of the computer as a programmable machine opens a new way of working: your results become something you can recreate, improve, and share.

In Chapter 1, we discussed graphical user interfaces (GUIs). GUIs are convenient because they hide complexity. You click a button and something happens. But when steps are hidden, it becomes harder to explain exactly what was done. In data science, hidden steps create risk. If you cannot describe precisely what the computer did, you cannot fully trust the result.

The command line is different. It does nothing unless you tell it to do something. It does not guess your intention. It does not silently fix your mistakes. It waits for instructions. That clarity may feel strict, but it is powerful. You are not at the mercy of the computer. You instruct it.

A key step toward this mindset is understanding how a computer organizes information. Every file on your machine lives in a folder, and every folder has a path that tells the computer exactly where it is. Data scientists work directly with these paths because tools such as Python, R, Git, and Quarto all expect you to know where your work lives. When you understand the file system, you can tell your tools exactly which data to use and where to save your results.

Much of data science relies on plain-text files. These include data files like .csv, scripts like .py or .R, and documents like .qmd. Plain text is transparent: you can open it anywhere, track changes, and process it automatically. This clarity is the reason modern analysis avoids mixing computation with formatting. In contrast, spreadsheets hide steps inside cells, and slides require manual updates each time your results change. They are useful for quick checks but cannot support serious analysis where every step must be clear.

This leads to one of the most important ideas in data science: reproducibility. Your future self and anyone who reads your work should be able to start with your raw data and code and arrive at the same results. Reproducibility protects you from accidental mistakes, forgotten steps, and lost work. It turns your analysis into something reliable, explainable, and extendable. As you progress through this book, everything you do will be built with reproducibility in mind.

Learning to use your computer this way requires an engineering mindset. You must be comfortable giving clear instructions, checking results, and accepting responsibility for what happens. That mindset begins here.

This chapter sets the foundation for the tools that follow. Once you see your computer as a programmable partner rather than a collection of apps, learning the command line, version control, and reproducible documents becomes natural.

2.2 What’s Inside Your Computer

Before we talk to the computer through the command line, it helps to know what is inside the box. You do not need to become a hardware expert, but a few ideas will make later chapters much easier to understand.

At a high level, every computer used for data science has four key parts:

CPU (Central Processing Unit) The CPU is the “brain” of the computer. It follows instructions one step at a time and does the general-purpose work in your programs.
RAM (Random-Access Memory) RAM is the computer’s short-term memory. When you open a dataset in Python or R, it is copied into RAM so the CPU can work with it quickly. If you do not have enough RAM, large projects slow down or crash.
Storage (SSD or hard drive) Storage is the long-term memory. Your files, photos, code, and datasets live here even after you shut down the computer. Solid-state drives (SSD) are faster and more reliable than older spinning hard drives. An SSD has no moving parts, which makes it faster and less fragile than older drives.
GPU (Graphics Processing Unit) The GPU started as a special chip for drawing graphics and games. Modern data science uses GPUs for huge math problems, such as training deep learning models, because they can do many simple calculations in parallel.

You can think of the CPU as a student solving math problems, RAM as the open notebook on the desk, storage as the backpack and bookshelf, and the GPU as a team of helpers who can all work on similar problems at the same time.

Advanced: How Much Is “Enough”?

For small school projects, almost any modern laptop will work. As you move toward bigger datasets and models, more RAM usually helps more than more CPU cores. If you often run out of memory when loading data, adding RAM or moving to a machine with more RAM makes a big difference. GPUs become important mainly for large deep learning or image-heavy projects.

2.3 Operating Systems

An operating system (OS) is the layer of software that connects your programs to the hardware. It controls files, devices, memory, and how programs run. The three main families you will see in data science are Windows, macOS, and Linux.

Linux Linux is the standard in professional data science and on almost all servers and cloud machines. It is predictable: nothing major happens behind your back. When you install software or change a setting, it stays that way until you change it again. The command line tools you learn in this book behave the same way on almost every Linux system.
macOS macOS is based on Unix, like Linux, and includes a built-in terminal that understands almost all the same commands. Many data scientists who use laptops choose macOS because it balances a friendly interface with powerful command line tools.
Windows Windows makes it easy to start, but it also tries to “help” by hiding file extensions, changing paths, and running background tools that can confuse reproducible work. Different parts of Windows sometimes disagree about how to name files or run commands. These shortcuts can be convenient for everyday use but can teach habits that do not transfer well to Linux or the cloud.

In this book, we will prefer a Unix-style command line everywhere. On macOS and Linux, that means using the built-in Terminal app. On Windows, that means installing and using Git Bash, which brings a Unix-like terminal to Windows so that the same commands work on all three platforms.

A predictable system is easier to master. A system that behaves exactly as instructed is easier to debug. In data science, predictability matters more than convenience.

2.3.1 Installing Software from the Command Line

Data scientists often install tools using package managers, which are programs that download, install, and update software for you from the command line. This is faster, more repeatable, and easier to document than clicking through many installer windows.

Here are the most common options:

Linux Each Linux distribution comes with its own package manager. For example: apt on Ubuntu, dnf on Fedora, and pacman on Arch. These tools let you install almost everything you need with a single command.
macOS On macOS, the most popular package manager is Homebrew. After installing Homebrew once, you can run commands like brew install git or brew install python to get new tools.
Windows On Windows 10 and 11, you can use winget from the command line to install software in Windows Shells (PowerShell or CMD). For example, yo can install Git Bash with winget install Git.Git. Git Bash is a Windows terminal that lets you use the same command-line tools and Git commands commonly used on Linux and macOS. However, winget is a Windows-only tool and should be run in PowerShell or Command Prompt, not in Git Bash.

Advanced: Other Options on Windows

Some developers use full Linux environments on Windows through the Windows Subsystem for Linux (WSL) or alternative package managers such as Chocolatey. These are powerful options when you do a lot of development on Windows, but you do not need them for this book.

2.4 How Computers Represent Numbers

Computers are built on binary, a number system that uses only zeros and ones. That design works well for storing whole numbers, but it creates some surprises when we work with decimals.

There are two basic kinds of numbers you will see in data science:

Integers (…, -2, -1, 0, 1, 2, …) These are whole numbers with no decimal part. Computers can store many integers exactly.
Floating-point numbers (like 0.1, 2.75, or -3.14) These are used for decimals and measurements. Most real-valued data in science and statistics are stored as floating-point numbers.

Because computers use binary, many simple-looking decimals cannot be stored exactly. For example, the decimal number 0.1 turns into a long repeating pattern in binary. The computer stores a very close approximation instead of the exact value. When you combine many such numbers, the tiny differences can show up as small rounding errors.

You may have seen examples where a language reports that 0.1 + 0.2 is 0.30000000000000004 instead of exactly 0.3. This is not a bug in Python or R. It is a consequence of how floating-point numbers are stored in hardware. Data scientists work with this by rounding results for display and by avoiding direct equality checks with decimals.

Below is a simple demonstration in Python.

0.1 + 0.2 == 0.3

False

0.1 + 0.2

0.30000000000000004

round(0.1 + 0.2, 10) == 0.3

True

The key ideas to remember are:

Some decimals cannot be represented exactly on a computer.
Small rounding differences are normal in real-number calculations.
We usually care about being “close enough” rather than perfectly exact.

2.4.1 Integer overflow

Floating-point numbers are not the only source of surprises. Integers can also behave in unexpected ways when they are stored using a fixed number of bits.

In many systems, integers are stored with a fixed width, such as 8 bits, 16 bits, 32 bits, or 64 bits. When a number becomes too large to fit in that fixed space, it wraps around. This is called integer overflow.

Python’s built-in int type can grow automatically. However, many data science libraries use fixed-size integers for efficiency.

import numpy as np

x = np.int8(127)
x, x + 1

/var/folders/cq/5ysgnwfn7c3g0h46xyzvpj800000gn/T/ipykernel_15449/3977345532.py:4: RuntimeWarning: overflow encountered in scalar add
  x, x + 1

(np.int8(127), np.int8(-128))

An 8-bit signed integer can store values from -128 to 127. When you add 1 to 127, the value wraps around. The program does not crash. It continues with a new value.

Mastery means understanding not only commands, but limits.

Advanced: Why This Matters in Data Science

The standard format for floating-point numbers in most languages is called IEEE 754. It trades exactness for speed and a wide range of values. When you compare floating-point results across languages or machines, tiny differences are expected. When results must be exactly reproducible bit-for-bit, experts sometimes use special libraries, exact arithmetic, or careful control of the hardware and compiler settings.

2.5 The Command Line: Speaking Your Computer’s Language

Most people use computers through windows, buttons, and icons. The command line offers a different approach: you type short instructions that the computer understands directly. This way of working gives you transparency, repeatability, and control that clicking cannot provide. Data science relies on these qualities because your work must be clear, sharable, and reproducible. On Windows, you will use Git Bash; on macOS or Linux, you will use the Terminal app.

The command line does nothing on its own. It waits for you. That can feel uncomfortable at first. Some students are hesitant the first time they open a terminal window because there are no visible buttons to guide them. That reaction is normal. With practice, the clarity becomes empowering.

2.5.1 Instructing Your Computer through Command Line

The command line is an interface where you communicate with the computer by typing commands. Each command performs one well-defined action. Because every action appears plainly on the screen, the command line makes your steps visible and traceable. This transparency helps you understand what you are doing, and it allows others to follow your work. In contrast, pointing and clicking through mouses leaves no reliable record. The command line also supports automation: a command that you type once can be saved in a script and reused whenever you need it. This repeatability is a cornerstone of reproducible data science.

When you write commands in a script and save them in a file, the computer can repeat those steps exactly. If you make a mistake, you can correct the script and run it again. If a friend wants to understand your analysis, you can send them the script instead of trying to describe what you clicked. This practice of scripting is at the heart of reproducible data science.

The command line also gives you access to tools that do not have a graphical interface at all. Many powerful utilities, including Git for version control and Quarto for reproducible documents, are designed to be run from the terminal. Learning the command line opens the door to these tools and lets you combine them in flexible ways.

At first, the command line may feel slower than clicking. That feeling fades as you learn the basic commands. Eventually, you will be able to move through folders, manage files, and run complex workflows with just a few keystrokes. Small scripts you write today can become the building blocks for larger projects later.

Becoming comfortable here is often a turning point. When you realize that you can direct the system precisely, you begin to see your computer not as a mystery, but as a tool you control.

2.5.2 Navigating Your System

When you open Git Bash or Terminal, you start in a current working directory — the folder where the computer assumes you want to work. The commands you type will use this folder unless you tell them otherwise. To use the command line effectively, you need to know where you are and how to move around.

The key commands for navigation are:

pwd — print working directory (shows your current folder)
ls — list files and folders in the current directory
cd <path> — change directory to the folder given by <path>

Paths come in two flavors:

Absolute paths start from the root of the file system and show the full route to a folder.
Relative paths start from your current location and describe how to get somewhere from there.

For example, on a typical system your home directory might be something like /Users/alex on macOS or Linux, or C:\Users\alex on Windows. In Git Bash, the Windows path will appear in a Unix-style form such as /c/Users/alex. If you are in your home directory and you want to move to a subfolder called projects, you can run cd projects. If you are somewhere else and want to jump straight to your home directory, you can use cd ~.

The .. symbol means “the parent directory” — the folder that contains the one you are in:

cd .. moves you up one level.
cd ../.. moves you up two levels.

The . symbol means “the current directory”. You will often see it when running programs that should use the current folder as their starting point.

As you practice, pay attention to the prompt in your terminal. It often shows your current directory or at least the last part of the path. This small detail helps you keep track of where you are working.

When you always know where you are in the file system, you reduce mistakes. Precision begins with location.

2.5.3 Managing Files and Folders

The command line also lets you create, move, and delete files and folders. At first, these actions may feel risky, but they quickly become a precise way to organize your projects.

Common commands include:

mkdir <name> — create a new folder
touch <filename> — create an empty file
rm <filename> — remove a file
rm -r <folder> — remove a folder and everything inside
cp <source> <destination> — copy a file
mv <old> <new> — rename or move a file

Before removing anything, confirm your location with pwd and ls. The command line will execute exactly what you type.

Clear and consistent naming makes your work easier to understand and avoids errors later, especially when your projects grow. Good naming practices for data science include:

Use lower case whenever possible (data/, not Data/).
Avoid spaces, which cause trouble in the terminal (raw_data, not raw data).
Use dashes or underscores, but pick one and stay consistent (weather-data or weather_data).
Avoid special characters such as !, ?, *, #, or &.
Prefer short, descriptive names (scripts/, figures/, clean.py).
Organize by purpose, not by date alone. Use folders like data/, scripts/, projects/, and output/.
Keep related files together, and avoid scattering pieces of the same project across unrelated locations.

Good naming makes your work predictable—for you, your future self, and anyone you collaborate with. It also reduces mistakes when writing paths or running scripts from the terminal.

Naming is part of mastery. When your file structure is clear, your thinking is clearer too.

2.5.4 Running Programs from the Terminal

The command line can also start programs and check whether your tools are installed correctly. Examples include:

codium . — open the current folder in VSCodium
git --version — check that Git is installed
python --version — check your Python installation
R --version — confirm that R is available
quarto check — verify that Quarto is installed correctly

Launching programs from the terminal reinforces the idea that your computer is programmable. It also prepares you for workflows where scripts and tools need to run together smoothly.

2.5.5 Mini-Project: Creating Your First Data Science Workspace

To make these ideas concrete, you will now create a simple workspace for your future projects.

Open the Terminal On Windows, start Git Bash. On macOS or Linux, open the Terminal app.
Find Your Home Directory Run pwd to display your current working directory. If you are not already in your home directory, move there using cd ~.
Make a my_project Folder Create a directory named my_project with mkdir my_project. Move into it using cd my_project. Confirm that you are in the right place by running pwd.
Add a Few Subfolders Inside my_project, create directories named data, analysis, and figures. Check your work with ls to ensure they appear.
Practice Moving Around Change into the data directory with cd data. Move back to my_project with cd ... Try moving directly to figures using cd figures. Move up one level with cd ... Return to the previous directory using cd -.
Open a Project Folder in VSCodium From inside the my_project directory, open the project in VSCodium by running code . (If necessary, enable the “code” command in VSCodium.)
Create a Small Project Template Inside my_project, create a folder named mini_project. Within it, create subfolders data, analysis, and figures. Add an empty Quarto file using touch analysis/analysis.qmd and an empty CSV file using touch data/example.csv. Use ls -R to display the full directory structure.

Mastery grows from repeated small, precise actions.

2.6 Exercises

Open Git Bash or Terminal and run pwd. Write down the full path and circle the folder you are currently in.
Use ls to display the files and folders in your working directory. Then run ls -l and note at least two new pieces of information you see compared with ls.
Navigate to your home directory using cd ~. Verify your location using pwd.
Navigate to your Downloads folder and verify your location using pwd.

On macOS or Linux:
```
cd ~/Downloads
pwd
```
On Windows (PowerShell):
```
cd $HOME/Downloads
pwd
```
Navigate to your home directory using cd ~. Create a folder named my_project with mkdir my_project. Move into it with cd my_project and verify your location using pwd.
Inside my_project, create subfolders for data, analysis, and figures. Use ls to confirm that the folders were created. Practice moving between them with cd, cd .., and cd -.
From inside the my_project directory, open the project in VSCodium by running code .. In the VSCodium terminal, run ls -R to display the full directory structure.
Inside my_project, create a folder named mini_project with subfolders data, analysis, and figures. Add an empty Quarto file using touch analysis/analysis.qmd and an empty CSV file using touch data/example.csv. Use ls -R mini_project to check that everything is in the right place.