2 Your Computer as a Tool for Discovery
2.1 Introduction
Many people use computers only through apps, clicking buttons designed by someone else. Data science works differently. Instead of staying inside fixed menus, you will learn to give the computer precise instructions so it can help you ask questions, test ideas, and make new discoveries. Thinking of the computer as a programmable machine opens a new way of working: your results become something you can recreate, improve, and share.
A key step toward this mindset is understanding how a computer organizes information. Every file on your machine lives in a folder, and every folder has a path that tells the computer exactly where it is. Data scientists work directly with these paths because tools such as Python, R, Git, and Quarto all expect you to know where your work lives. When you understand the file system, you can tell your tools exactly which data to use and where to save your results.
Much of data science relies on plain-text files. These include data files like .csv, scripts like .py or .R, and documents like .qmd. Plain text is transparent: you can open it anywhere, track changes, and process it automatically. This clarity is the reason modern analysis avoids mixing computation with formatting. In contrast, spreadsheets hide steps inside cells, and slides require manual updates each time your results change. They are useful for quick checks but cannot support serious analysis where every step must be clear.
This leads to one of the most important ideas in data science: reproducibility. Your future self and anyone who reads your work should be able to start with your raw data and code and arrive at the same results. Reproducibility protects you from accidental mistakes, forgotten steps, and lost work. It turns your analysis into something reliable, explainable, and extendable. As you progress through this book, everything you do will be built with reproducibility in mind.
This chapter sets the foundation for the tools that follow. Once you see your computer as a programmable partner rather than a collection of apps, learning the command line, version control, and reproducible documents becomes natural.
2.2 What’s Inside Your Computer
Before we talk to the computer through the command line, it helps to know what is inside the box. You do not need to become a hardware expert, but a few ideas will make later chapters much easier to understand.
At a high level, every computer used for data science has four key parts:
CPU (Central Processing Unit)
The CPU is the “brain” of the computer. It follows instructions one step at a time and does the general-purpose work in your programs.RAM (Random-Access Memory)
RAM is the computer’s short-term memory. When you open a dataset in Python or R, it is copied into RAM so the CPU can work with it quickly. If you do not have enough RAM, large projects slow down or crash.Storage (SSD or hard drive)
Storage is the long-term memory. Your files, photos, code, and datasets live here even after you shut down the computer. Solid-state drives (SSD) are faster and more reliable than older spinning hard drives.GPU (Graphics Processing Unit)
The GPU started as a special chip for drawing graphics and games. Modern data science uses GPUs for huge math problems, such as training deep learning models, because they can do many simple calculations in parallel.
You can think of the CPU as a student solving math problems, RAM as the open notebook on the desk, storage as the backpack and bookshelf, and the GPU as a team of helpers who can all work on similar problems at the same time.
For small school projects, almost any modern laptop will work. As you move toward bigger datasets and models, more RAM usually helps more than more CPU cores. If you often run out of memory when loading data, adding RAM or moving to a machine with more RAM makes a big difference. GPUs become important mainly for large deep learning or image-heavy projects.
2.3 Operating Systems and Why Professionals Love Linux
An operating system (OS) is the layer of software that connects your programs to the hardware. It controls files, devices, memory, and how programs run. The three main families you will see in data science are Windows, macOS, and Linux.
Linux
Linux is the standard in professional data science and on almost all servers and cloud machines. It is predictable: nothing major happens behind your back. When you install software or change a setting, it stays that way until you change it again. The command line tools you learn in this book behave the same way on almost every Linux system.macOS
macOS is based on Unix, like Linux, and includes a built-in terminal that understands almost all the same commands. Many data scientists who use laptops choose macOS because it balances a friendly interface with powerful command line tools.Windows
Windows makes it easy to start, but it also tries to “help” by hiding file extensions, changing paths, and running background tools that can confuse reproducible work. Different parts of Windows sometimes disagree about how to name files or run commands. These shortcuts can be convenient for everyday use but can teach habits that do not transfer well to Linux or the cloud.
In this book, we will prefer a Unix-style command line everywhere. On macOS and Linux, that means using the built-in Terminal app. On Windows, that means installing and using Git Bash, which brings a Unix-like terminal to Windows so that the same commands work on all three platforms.
2.3.1 Installing Software from the Command Line
Data scientists often install tools using package managers, which are programs that download, install, and update software for you from the command line. This is faster, more repeatable, and easier to document than clicking through many installer windows.
Here are the most common options:
Linux
Each Linux distribution comes with its own package manager. For example:apton Ubuntu,dnfon Fedora, andpacmanon Arch. These tools let you install almost everything you need with a single command.macOS
On macOS, the most popular package manager is Homebrew. After installing Homebrew once, you can run commands likebrew install gitorbrew install pythonto get new tools.Windows
On Windows 10 and 11, you can usewingetfrom the command line to install software. Once Git Bash is installed and set up, you can run commands such aswinget install Python.Python.3.12 winget install Git.Git winget install RProject.R winget install Quarto.Quartofrom a terminal and let Windows handle the downloads and installation steps.
Some developers use full Linux environments on Windows through the Windows Subsystem for Linux (WSL) or alternative package managers such as Chocolatey. These are powerful options when you do a lot of development on Windows, but you do not need them for this book.
2.4 How Computers Represent Numbers
Computers are built on binary, a number system that uses only zeros and ones. That design works well for storing whole numbers, but it creates some surprises when we work with decimals.
There are two basic kinds of numbers you will see in data science:
Integers (…, -2, -1, 0, 1, 2, …)
These are whole numbers with no decimal part. Computers can store many integers exactly.Floating-point numbers (like 0.1, 2.75, or -3.14)
These are used for decimals and measurements. Most real-valued data in science and statistics are stored as floating-point numbers.
Because computers use binary, many simple-looking decimals cannot be stored exactly. For example, the decimal number 0.1 turns into a long repeating pattern in binary. The computer stores a very close approximation instead of the exact value. When you combine many such numbers, the tiny differences can show up as small rounding errors.
You may have seen examples where a language reports that 0.1 + 0.2 is 0.30000000000000004 instead of exactly 0.3. This is not a bug in Python or R. It is a consequence of how floating-point numbers are stored in hardware. Data scientists work with this by rounding results for display and by avoiding direct equality checks with decimals.
The key ideas to remember are:
- Some decimals cannot be represented exactly on a computer.
- Small rounding differences are normal in real-number calculations.
- We usually care about being “close enough” rather than perfectly exact.
The standard format for floating-point numbers in most languages is called IEEE 754. It trades exactness for speed and a wide range of values. When you compare floating-point results across languages or machines, tiny differences are expected. When results must be exactly reproducible bit-for-bit, experts sometimes use special libraries, exact arithmetic, or careful control of the hardware and compiler settings.
2.5 The Command Line: Speaking Your Computer’s Language
Most people use computers through windows, buttons, and icons. The command line offers a different approach: you type short instructions that the computer understands directly. This way of working gives you transparency, repeatability, and control that clicking cannot provide. Data science relies on these qualities because your work must be clear, sharable, and reproducible. On Windows, you will use Git Bash; on macOS or Linux, you will use the Terminal app.
2.5.1 What the Command Line Is and Why Data Scientists Use It
The command line is an interface where you communicate with the computer by typing commands. Each command performs one well-defined action. Because every action appears plainly on the screen, the command line makes your steps visible and traceable. This transparency helps you understand what you are doing, and it allows others to follow your work. Clicking through menus, by contrast, leaves no reliable record. The command line also supports automation: a command that you type once can be saved in a script and reused whenever you need it. This repeatability is a cornerstone of reproducible data science.
When you write commands in a script and save them in a file, the computer can repeat those steps exactly. If you make a mistake, you can correct the script and run it again. If a friend wants to understand your analysis, you can send them the script instead of trying to describe what you clicked. This practice of scripting is at the heart of reproducible data science.
The command line also gives you access to tools that do not have a graphical interface at all. Many powerful utilities, including Git for version control and Quarto for reproducible documents, are designed to be run from the terminal. Learning the command line opens the door to these tools and lets you combine them in flexible ways.
At first, the command line may feel slower than clicking. That feeling fades as you learn the basic commands. Eventually, you will be able to move through folders, manage files, and run complex workflows with just a few keystrokes. Small scripts you write today can become the building blocks for larger projects later.
2.5.3 Managing Files and Folders
The command line also lets you create, move, and delete files and folders. At first, these actions may feel risky, but they quickly become a precise way to organize your projects.
Common commands include:
mkdir <name>— create a new folder
touch <filename>— create an empty file
rm <filename>— remove a file
rm -r <folder>— remove a folder and everything inside
cp <source> <destination>— copy a file
mv <old> <new>— rename or move a file
Clear and consistent naming makes your work easier to understand and avoids errors later, especially when your projects grow. Good naming practices for data science include:
- Use lower case whenever possible (
data/, notData/).
- Avoid spaces, which cause trouble in the terminal (
raw_data, notraw data).
- Use dashes or underscores, but pick one and stay consistent (
weather-dataorweather_data).
- Avoid special characters such as
!,?,*,#, or&.
- Prefer short, descriptive names (
scripts/,figures/,clean.py).
- Organize by purpose, not by date alone. Use folders like
data/,scripts/,projects/, andoutput/.
- Keep related files together, and avoid scattering pieces of the same project across unrelated locations.
Good naming makes your work predictable—for you, your future self, and anyone you collaborate with. It also reduces mistakes when writing paths or running scripts from the terminal.
2.5.4 Running Programs from the Terminal
The command line can also start programs and check whether your tools are installed correctly. Examples include:
code .— open the current folder in VS Code
git --version— check that Git is installed
python --version— check your Python installation
R --version— confirm that R is available
quarto check— verify that Quarto is installed correctly
Launching programs from the terminal reinforces the idea that your computer is programmable. It also prepares you for workflows where scripts and tools need to run together smoothly.
2.5.5 Mini-Project: Creating Your First Data Science Workspace
To make these ideas concrete, you will now create a simple workspace for your future projects.
Open the Terminal
On Windows, start Git Bash. On macOS or Linux, open the Terminal app.Find Your Home Directory
Runpwdto display your current working directory.
If you are not already in your home directory, move there usingcd ~.Make a
ds4hsFolder
Create a directory namedds4hswithmkdir ds4hs.
Move into it usingcd ds4hs.
Confirm that you are in the right place by runningpwd.Add a Few Subfolders
Insideds4hs, create directories nameddata,analysis, andfigures.
Check your work withlsto ensure they appear.Practice Moving Around
Change into thedatadirectory withcd data.
Move back tods4hswithcd ...
Try moving directly tofiguresusingcd figures.
Move up one level withcd ...
Return to the previous directory usingcd -.Open a Project Folder in VS Code
From inside theds4hsdirectory, open the project in VS Code by running
code .
(If necessary, enable the “code” command in VS Code.)Create a Small Project Template
Insideds4hs, create a folder namedmini_project.
Within it, create subfoldersdata,analysis, andfigures.
Add an empty Quarto file usingtouch analysis/analysis.qmd
and an empty CSV file usingtouch data/example.csv.
Usels -Rto display the full directory structure.
2.6 Exercises
Check Your Location
Open Git Bash or Terminal and runpwd.
Write down the full path and circle the folder you are currently in.List Folder Contents
Uselsto display the files and folders in your working directory.
Then runls -land note at least two new pieces of information you see compared withls.Create a Project Directory
Navigate to your home directory usingcd ~.
Create a folder namedds4hswithmkdir ds4hs.
Move into it withcd ds4hsand verify your location usingpwd.Build a Basic Folder Structure
Insideds4hs, create subfolders fordata,analysis, andfigures.
Uselsto confirm that the folders were created.
Practice moving between them withcd,cd .., andcd -.Open a Project Folder in VS Code
From inside theds4hsdirectory, open the project in VS Code by running
code .
(If necessary, enable thecodecommand in VS Code.)
In the VS Code terminal, runls -Rto display the full directory structure.
⭐ Challenge (optional):
Inside ds4hs, create a folder named mini_project with subfolders data, analysis, and figures. Add an empty Quarto file using touch analysis/analysis.qmd and an empty CSV file using touch data/example.csv. Use ls -R mini_project to check that everything is in the right place.