3  Right Tools for Data Science Projects

3.1 Introduction

This chapter introduces the tools required to carry out reproducible data science projects in this course. A reproducible project is one in which the full analysis—data inputs, code, narrative, and results—can be rerun by someone else, or by the same analyst at a later time, with the same outcomes. Reproducibility is therefore a property of an entire project rather than of any single tool. The goal of this chapter is not to teach tools in isolation, but to explain how a small set of tools work together to support this project-level reproducibility.

Without a structured workflow, data science work tends to become fragile. Files are renamed or overwritten, intermediate results are lost, and analysis steps are no longer clear even to the original author. These problems become more severe as projects grow in size or involve collaboration. In an academic setting, they also make it difficult to review, grade, or extend work. The tools introduced in this chapter are motivated by these common failure modes and are chosen to prevent them in a systematic way.

The tools in this chapter form a coherent workflow rather than a loose collection of software. Git records the history of a project and makes changes explicit. GitHub provides a shared location for storing, submitting, and reviewing projects. Quarto allows code, text, and results to live in a single executable document. Python, R, or Julia provides the computational engine for data analysis. VS Codium serves as a coordinating environment that brings these tools together in daily work. Each tool plays a distinct role, and reproducibility emerges from their interaction.

This chapter includes parallel setup sections for Python, R, and Julia. You only need to choose one language and follow that section. If you already know one of these languages, start there. Otherwise, Python, R, and Julia are all reasonable choices. Quarto can technically execute code cells written in different languages within the same document, but for clarity and consistency you are expected to work primarily in a single language. Once your chosen language is set up, you can skip the other language sections; the remainder of the book assumes a single-language workflow rather than switching between languages.

All work produced using this setup is treated as a reproducible project. Files are organized in a clear directory structure, tracked with version control, and rendered into executable documents that combine narrative, code, and results. Screenshots, manually edited outputs, and undocumented analysis steps are not substitutes for this workflow. Once learned, this way of working is not limited to data science courses: the same tools can be used to manage assignments in other classes, maintain a research diary, write a blog, draft a novel, or even organize personal notes and creative writing. By the end of this chapter, you should be able to set up a complete project, make changes in a controlled way, and produce work that can be reliably revisited, reused, and shared.

In the rest of this chapter, tools are introduced in the order you need them in a real workflow: command line first, then a language, then Quarto, then Git, then an editor, and finally GitHub.

3.2 The Command Line Is the Foundation

This chapter assumes you have already learned the basics of the command line (also called the shell) earlier in the book. The command line is the foundation for the entire tooling workflow, because it is the same on any computer and it works with any editor.

3.2.1 Why the command line still matters

For many beginners, the command line feels old-fashioned compared to graphical menus. In practice, it remains central to technical work for several reasons.

  • Speed. Typing commands is often faster than navigating menus. Autocompletion means you rarely type full filenames or commands.
  • Precision. A command specifies exactly which program and which files are used. There is no ambiguity about what happened.
  • Reproducibility. Commands can be written down, copied, and rerun. Mouse clicks cannot be reliably reproduced.
  • Universality. The same commands work in a regular terminal, on a remote server, or inside an editor such as VSCodium.
  • Professional practice. In movies and television, technical experts are almost always shown typing commands rather than clicking icons. This reflects reality: serious technical work rewards tools that favor speed, precision, and repeatability.

Throughout this chapter, you will repeatedly do three things:

  • Navigate into a project folder
  • Run a program (Git, Quarto, Python/R/Julia)
  • Check what happened (files created, versions, error messages)

When you install a tool, you should also verify it from the command line. A typical verification pattern is:

toolname --version

For example:

git --version
quarto --version
python --version
R --version
julia --version

You do not need to memorize commands. The goal is to build the habit of checking what is installed and what version you are using.

3.3 The Workflow Before the Tools (A Conceptual Preview)

Before installing anything, it helps to see the full workflow at a high level. Reproducible data science is not one tool. It is a small system of tools that work together.

A typical workflow looks like this:

  • Write a Quarto file (.qmd) that contains text and code
  • Run the code using your chosen language (Python, R, or Julia)
  • Render the Quarto file into an output document (usually HTML)
  • Use Git to record snapshots of your work (commits)
  • Use GitHub to share the work and collaborate

The transcript below is a preview. Do not try to run it yet. You will install the tools and run these commands later, step by step.

# create a project folder
mkdir my-project
cd my-project

# write a Quarto file (in an editor)
# then render it
quarto render analysis/report.qmd

# record your work with Git
git init
mkdir analysis
notepad analysis/report.qmd

git add analysis/report.qmd
git commit -m "First report"

# share your work on GitHub (after setup)
git remote add origin <url>
git push -u origin main

If you understand the purpose of each line in this transcript by the end of the chapter, you have learned the core tooling workflow.

3.4 Gate 1: Choose a Programming Language and Set Up Its Environment

In this book, you will write code in one programming language. Choose exactly one language to start (Python, R, or Julia), and follow the installation instructions for that language only. You can always learn a second language later.

This gate has one goal: you should be able to run a short piece of code from the command line.

3.4.1 Python

Python is the most widely used programming language in modern data science and an excellent first language for beginners. Its clean, readable syntax allows students to focus on ideas rather than punctuation, while its large ecosystem means that most common tasks already have well-tested libraries. In this course, Python will be used to write code, analyze data, and produce graphics. These skills form the foundation for reproducible data science workflows introduced later in the book.

Python has earned its central role because it balances power and accessibility. The base language is expressive and easy to read, and many core tools are included by default. Beyond that, widely used libraries such as pandas for data analysis, matplotlib for visualization, and scikit-learn for machine learning allow students to move quickly from simple examples to realistic projects. While R is also an excellent language, especially for statistics, Python will be the default language in this book, with R treated as an optional companion.

3.4.1.1 Installing Python (Python 3.12)

We will use Python 3.12. At the time of writing, this version is supported by major scientific and machine-learning libraries, including TensorFlow and PyTorch. From the beginning, we use the command line to install and manage software so that the workflow is transparent and reproducible.

Choose the instructions for your operating system.

Windows (using winget)

Open PowerShell (not Git Bash) and run:

winget install -e --id Python.Python.3.12

After installation, close and reopen your terminal so that python is available on the PATH.

macOS (using Homebrew)

If Homebrew is not installed, follow the instructions at https://brew.sh.

Then install Python 3.12 by running:

brew install python@3.12

Homebrew installs Python in a standard location and makes it available from the terminal.

Linux (Debian / Ubuntu using apt)

These instructions assume a Debian- or Ubuntu-based system.

First update package information:

sudo apt update

Install Python 3.12 and the virtual environment module:

sudo apt install python3.12 python3.12-venv

Verify the installation:

python3.12 --version

You should see output similar to:

Python 3.12.x

If this command fails, resolve the installation before continuing.

3.4.1.2 Creating a virtual environment

Virtual environments should be created in a deliberate and consistent location. We recommend creating a dedicated directory in your home folder to store environments for this course.

From a terminal, create a directory named envs in your home directory:

mkdir -p ~/envs

Create a virtual environment named ds-env inside this directory:

python -m venv ~/envs/ds-env

Activate the environment:

  • Windows (Git Bash):

    source ~/envs/ds-env/Scripts/activate
  • macOS / Linux:

    source ~/envs/ds-env/bin/activate

When the environment is active, your prompt will usually change to show (ds-env).

Upgrade the package manager and install a small set of core libraries:

python -m pip install --upgrade pip
pip install numpy pandas matplotlib

All Python packages for this course should be installed inside this environment.

To leave the environment, run:

deactivate

3.4.1.3 A short Python warm-up

Before moving on, you should be comfortable with:

  • running Python from the command line,
  • writing and executing simple scripts,
  • using variables, lists, and loops,
  • importing and using packages.

For a concise, official warm-up that can be completed in 2–3 hours, use the following resources from the Python Documentation:

Read these sections in order. They provide just enough structure to begin experimenting with Python and to learn additional features on demand.

After completing this warm-up, you should be ready to work interactively and write small scripts. The next section builds on this foundation to introduce a reproducible data science workflow.

3.4.2 R

R is a programming language designed for statistical computing and graphics. It is widely used in statistics, biostatistics, and the parts of data science that emphasize modeling, inference, and visualization. R also has an unusually strong culture of packaging and documentation, which makes it well suited for project-based work where others need to rerun and review an analysis.

R is often preferred when the workflow depends on statistical modeling, publication-quality graphics, or domain-specific methods that are most mature in the R ecosystem. In practice, many projects use both Python and R, selecting the language based on the task rather than loyalty to a tool. This book uses Python as the default engine, but the workflow in this chapter applies equally well to R.

To keep R-based projects reproducible, you must manage two things carefully: the R version and the set of R packages used by the project. The installation steps below install R, and the environment steps show how to record and restore project-specific package versions.

3.4.2.1 Installing R (R 4.5)

Choose the instructions for your operating system.

Windows (using winget)

Open PowerShell (not Git Bash) and run:

winget install -e --id RProject.R

After installation, close and reopen your terminal.

Verify that R is available:

R --version
Rscript --version

If R is not found, it is usually because the installer did not add R to your PATH. In that case, locate your R installation (often under C:\\Program Files\\R\\) and either run R.exe / Rscript.exe from that folder or add the appropriate bin directory to your PATH.

macOS (using Homebrew)

Install R:

brew install r

Verify:

R --version
Rscript --version

Linux (Debian / Ubuntu using apt)

For a straightforward installation, use your system package manager:

sudo apt update
sudo apt install r-base r-base-dev

Verify:

R --version
Rscript --version

If your distribution provides an older R than you need, install from the CRAN-maintained Ubuntu repository and follow the current instructions in the CRAN “Ubuntu Packages For R” guide.

3.4.2.2 Creating a project library with renv

R packages are installed into libraries, which are directories on disk. If you install packages globally, different projects can silently share (and overwrite) the same dependencies. This is convenient at first, but it eventually breaks reproducibility when package versions change.

For R projects in this book, use renv to manage a per-project package library. renv creates a project-local library and records dependency versions in a lockfile (renv.lock). Anyone with the same R version can restore the project library from that lockfile.

First, install renv (once per machine).

R -q -e "install.packages('renv')"

Then, from the root directory of an R project, initialize renv:

R -q -e "renv::init()"

Install core packages used in many projects:

R -q -e "install.packages(c('ggplot2', 'dplyr', 'readr'))"

Snapshot the environment to update renv.lock:

R -q -e "renv::snapshot()"

On another machine (or after deleting the project library), restore the environment from the lockfile:

R -q -e "renv::restore()"

In a Git repository, commit renv.lock and the renv/ infrastructure files created by renv. Do not commit the installed package binaries in renv/library/ unless you have a specific reason to do so.

3.4.2.3 A short R warm-up

Before moving on, you should be comfortable with:

  • starting R from the command line,
  • running one-line commands with R -e,
  • writing and running scripts with Rscript,
  • using vectors, lists, and data frames,
  • loading packages and reading help pages.

For an official warm-up that can be completed in a few hours, use the R Core Team manual “An Introduction to R” and work through the sections on objects, data structures, and graphics. As you work, practice using help() and help.search() to find documentation from within R.

After completing this warm-up, you should be ready to write small R scripts and to render Quarto documents using R as the execution engine.

3.4.3 Julia

Julia is a programming language designed for numerical and scientific computing. It combines a high-level, expressive syntax with performance that is often comparable to low-level languages such as C or Fortran. This design makes Julia attractive for simulation-heavy workloads, optimization, and research code where clarity and speed are both important.

Julia is increasingly used in data science and applied statistics when projects involve custom algorithms, large simulations, or performance- critical components that would be cumbersome to write in other high-level languages. It is less ubiquitous than Python or R, but its package ecosystem is mature enough for many modeling, visualization, and data manipulation tasks.

As with Python and R, reproducibility in Julia depends on controlling the language version and the exact versions of packages used. Julia was designed with this in mind: its built-in package manager records full dependency graphs for each project, making environment management a first-class feature rather than an add-on.

3.4.3.1 Installing Julia (1.x series)

Choose the instructions for your operating system.

Windows (using winget)

Open PowerShell (not Git Bash) and run:

winget install -e --id JuliaLang.Julia

After installation, close and reopen your terminal.

Verify:

julia --version

macOS (using Homebrew)

Install Julia:

brew install julia

Verify:

julia --version

Linux

Download the official Linux binary from the Julia website and extract it to a convenient location, such as /opt/julia or your home directory. Then add the bin directory to your PATH.

Verify:

julia --version

If julia is not found, check that the directory containing the Julia binary is on your PATH.

3.4.3.2 Project environments with Julia’s package manager

Julia uses project environments to isolate dependencies. Each project is associated with a Project.toml file that lists direct dependencies and a Manifest.toml file that records the exact versions of all packages, including transitive dependencies.

To create or activate a project environment, navigate to the project root directory and start Julia:

julia

At the Julia prompt, activate a local environment:

]
activate .

The closing bracket ] switches Julia into package manager mode. The command activate . tells Julia to use a project environment stored in the current directory.

Add commonly used packages:

]
add DataFrames CSV Plots

This creates (or updates) Project.toml and Manifest.toml in the project directory. These files fully specify the environment.

On another machine, or after cloning the repository, activate the project and instantiate the environment:

]
activate .
instantiate

In a Git repository, commit both Project.toml and Manifest.toml. These files play the same role as requirements.txt or renv.lock in other languages, but with stricter guarantees about reproducibility.

3.4.3.3 A short Julia warm-up

Before moving on, you should be comfortable with:

  • starting Julia from the command line,
  • using the Julia REPL and its help system,
  • activating and instantiating project environments,
  • working with arrays, dictionaries, and tables,
  • loading packages with using and import.

A concise and authoritative starting point is the official Julia manual “Getting Started” and the sections on the REPL, packages, and performance tips. After this warm-up, you should be able to write small Julia scripts and use Julia as an execution engine in Quarto documents when a project benefits from Julia’s performance and numerical strengths.

3.5 Gate 2: Quarto (Reproducible Documents for Real Data Science)

Now that you can run code in your chosen language, you can use Quarto to combine code, text, and results in one reproducible document.

3.5.1 Why Quarto?

Quarto is a tool for writing documents that combine text, code, figures, and results in a single, executable source file. Instead of keeping separate word-processor files, exported plots, screenshots, and loose scripts, Quarto keeps analysis and narrative together and regenerates results automatically whenever code or data change. This makes work transparent, reproducible, and easier to review.

Quarto is widely used in data science for technical reports, notebooks, and presentations where correctness and traceability matter. Compared to traditional notebooks, Quarto emphasizes documents first and interactivity second. Compared to Word or PowerPoint, it prioritizes reproducible computation over manual formatting. These properties make Quarto well suited for teaching and for real data science projects, where reasoning and evidence are more important than appearance.

3.5.2 Installation and setup (CLI-first)

Quarto is a command-line tool. We install and use it from the terminal, independent of any editor.

Windows (using winget)

Open PowerShell and run:

winget install --id Posit.Quarto -e

Close and reopen the terminal after installation.

macOS (using Homebrew)

brew install quarto

Linux (Debian / Ubuntu)

sudo apt update
sudo apt install quarto

After installation, verify that Quarto is available:

quarto check

This command reports available engines (such as Python) and confirms that Quarto is correctly installed.

Editor support (optional)

Quarto works entirely from the command line. If you use VSCodium, you may optionally install the Quarto extension for syntax highlighting and convenience features. This is not required for rendering documents.

3.5.3 Quarto and Python environments

Quarto does not manage Python installations or virtual environments. When rendering a document, it uses the Python interpreter available on your PATH.

For this course, you must activate your course environment before running Quarto commands.

source ~/envs/ds-env/bin/activate

On Windows (Git Bash):

source ~/envs/ds-env/Scripts/activate

After activation, verify:

python --version

Quarto will now execute all Python code chunks using this environment.

3.5.4 Anatomy of a Quarto file

A Quarto document (.qmd) has three main components, executed from top to bottom during rendering.

YAML header

The YAML header appears at the top of the file between --- lines and controls document-level settings such as title and output format.

---
title: "My Document"
format: html
---

Markdown body

The body contains text written in Markdown: headings, paragraphs, lists, links, and mathematical notation. This is where you explain your analysis and interpret results.

Code chunks

Code chunks contain executable code. During rendering, Quarto runs the code and inserts the output directly into the document. In this course, we use Python by default.

print(“Hello, Quarto”)

3.5.5 First reproducible notebook

A minimal workflow for your first Quarto document is:

  • create a source file,
  • write text and code together,
  • render the document from the command line.

Create a new file named my_first.qmd with the following header:

---
title: "my-first-quarto-notebook"
format: html
---

Add a short paragraph:

This document demonstrates a simple, reproducible analysis written in
Python using Quarto.

Before running this example, make sure the required plotting library is installed in your active environment:

pip install matplotlib

Insert a Python code chunk that produces a figure:

import matplotlib.pyplot as plt

plt.plot([1, 2, 3], [1, 4, 9]) plt.title(“A Simple Plot”) plt.show()

Activate your environment and render:

source ~/envs/ds-env/bin/activate
quarto render my_first.qmd

This produces an HTML file that contains the text, code, and generated output.

3.6 Gate 3: Git (Local Version Control for Reproducible Projects)

Git comes after you can run code and render a document, because Git is most useful when it records real work: your source files and the outputs you choose to keep.

3.6.1 Why Git?

Version control is a foundational skill for data science because it treats your work as a living project with a complete history. Git lets you track every change you make, recover from mistakes, and collaborate without overwriting anyone’s work. Unlike saving multiple file versions by hand (e.g., project_final_v12_REAL), Git provides a precise, automatic timeline of your edits. This makes your work reproducible, auditable, and shareable, which are essential habits for scientific computing and data science projects.

Git is used to manage data science projects as evolving artifacts rather than one-time snapshots. Analyses typically change as data are cleaned, models are revised, and interpretations improve. A Git repository should contain only the files needed to understand, reproduce, and extend a project—nothing more.

3.6.2 Installing Git

Git must be installed before it can be used. Use the method appropriate for your system.

  • Windows (PowerShell):
winget install --id Git.Git -e

This installs Git together with Git Bash, a terminal environment used throughout this book. Do not run winget inside Git Bash.

  • macOS (Homebrew):
brew install git
  • Debian / Ubuntu Linux (including WSL):
sudo apt update
sudo apt install git

After installation, configure Git once so your work is properly attributed:

git config --global user.name "Your Name"
git config --global user.email "your@email.com"

Verify that Git is installed and configured correctly:

git --version
git config --list

3.6.3 Essential Git Commands

Git is most effective when you stage files deliberately and keep the repository clean. The following commands form the core of everyday Git usage in data science projects.

  • git init initializes a repository in a project folder.
  • git status shows which files have changed.
  • git add <file> stages selected files to be recorded.
  • git commit -m "message" saves a snapshot of staged changes.
  • git diff displays differences between versions.
  • git push sends committed changes to a remote repository, such as one hosted on GitHub.
  • git pull retrieves updates from a remote repository, such as one hosted on GitHub.

These two commands are where Git connects to an online platform. A common example of such a platform is GitHub, which serves as a shared location for backing up work and collaborating on projects. The next section focuses on GitHub itself and how it is used in our workflow.

Avoid using git add . as a default habit. It often stages generated files, temporary outputs, or other unintended content. Staging files explicitly helps keep repositories small, readable, and reproducible.

3.6.4 Basic Workflow

Git works best when your project is organized as a single folder containing scripts, Quarto files, and documentation. A typical workflow follows a simple cycle: edit files, inspect what changed, stage selected files, and commit them with a short message explaining what was done.

cd my-project
git init
git status
git add README.md
git add analysis.qmd
git commit -m "initial project structure"

Git can also connect your local project to a remote repository hosted online. This idea is introduced above when discussing git push and git pull, and it is developed in detail in the next section, which focuses on GitHub as a concrete example of an online hosting platform.

3.6.5 Good Practices for Git

Good Git usage is less about memorizing commands and more about developing clean, repeatable habits. The following practices are especially important for reproducible data science projects:

  • Keep the repository clean. Track only files that are needed to understand, reproduce, and extend the project. Avoid committing generated files, rendered outputs, large raw datasets, or temporary artifacts unless they are explicitly required.

  • Stage files deliberately. Add files explicitly rather than staging everything at once. This helps prevent accidental inclusion of unnecessary or generated content.

  • Commit small, logical changes. Each commit should represent a coherent step in the project, making the history easier to read and reason about.

  • Write informative commit messages. A short message explaining what changed and why is more valuable than a vague description.

  • Use .gitignore consistently. The .gitignore file is the primary mechanism for keeping a repository clean over time. It tells Git which files should never be tracked, such as temporary outputs, cache directories, and system-specific files.

3.7 Gate 4: VSCodium: An Integrated Development Environment

By this point, you can already do everything you need from the command line. An IDE does not replace the command line. It strengthens your workflow by combining editing, execution, and project organization in one environment.

3.7.1 An IDE Is a Unified Workspace

An Integrated Development Environment (IDE) is a unified workspace for writing code, running programs, and managing projects.

Instead of switching between a text editor, a terminal window, and a file browser, you work in one coordinated system. An IDE typically provides:

  • Syntax highlighting (color-coding parts of code for readability)
  • Real-time error detection while you type
  • Integrated terminals
  • Project-wide search and navigation
  • Version control integration (such as Git)

The goal is not convenience alone. The goal is structured, reproducible work.

Check-in question:

If you can already use the command line, why might working in a unified environment reduce mistakes?

3.7.2 Why This Book Uses VSCodium

This book uses VSCodium because it is fully open-source software.

Two closely related editors exist:

  • Visual Studio Code (distributed by Microsoft)
  • VSCodium (community-built from the same open-source code)

Visual Studio Code is based on open-source code. However, the official Microsoft build includes additional proprietary components such as telemetry (automatic usage reporting) and Microsoft-specific licensing.

VSCodium is built from the same public source code but removes proprietary components and telemetry. It is distributed under fully open-source licenses.

Throughout this book, we choose open-source tools whenever possible. Open-source software makes its source code publicly available. Anyone can inspect it, modify it, and redistribute it under its license.

This matters for three reasons:

  1. Transparency — the code can be examined.
  2. Reproducibility — the tools remain accessible.
  3. Longevity — projects do not depend on a single vendor.

If you already use Visual Studio Code, you may continue using it. For the workflow in this book, both behave the same.

3.7.3 VSCodium Supports Multiple Languages and Terminals

VSCodium is not just a text editor. It is a multi-language development environment.

You can:

  • Edit Python, R, Julia, Markdown, and Quarto files
  • Open multiple terminals at the same time
  • Run different shells in different terminals
  • Work in several programming environments simultaneously

For example, you might:

  • Run Python in one terminal
  • Run R in another
  • Use Git in a third
  • Edit a Quarto document side-by-side with your analysis script

All within one window.

This reduces context switching and keeps your project organized.

3.7.4 Installation

Install VSCodium using the method appropriate for your system.

Windows

Install using winget:

winget install -e --id VSCodium.VSCodium

macOS

Install via Homebrew:

brew install --cask vscodium

Linux

Install using your distribution’s package manager or the official packages provided on the VSCodium website.

After installation, launch VSCodium and open the integrated terminal (View — Terminal). Confirm that it can access the tools you installed earlier:

git --version
quarto --version

Also verify the language you chose to work with:

python --version   # if using Python
R --version        # if using R
julia --version    # if using Julia

If these commands run successfully, VSCodium is correctly connected to your system.

3.7.5 Extensions Connect Languages to the Editor

Extensions allow VSCodium to understand specific languages and tools.

Install only what you need. A focused setup is easier to maintain.

Core extensions (recommended for everyone):

  • Quarto — Authoring and rendering Quarto documents
  • GitLens — Enhanced Git history and comparison tools

Language-specific extensions (choose one):

  • Python — Editing, linting, debugging, notebook support
  • R — Script editing and execution support
  • Julia — Language support and code execution

Important:

Extensions do not install Python, R, Julia, Git, or Quarto. They connect VSCodium to tools already installed on your system.

3.7.6 The Integrated Terminal Is Part of the Workflow

The integrated terminal turns VSCodium into a full command-line workspace.

Open it with View — Terminal or Ctrl+` (Control + backtick).

You can open multiple terminals and select different shells for each one. This allows you to:

  • Run Git commands
  • Execute Python, R, or Julia scripts
  • Render Quarto documents
  • Navigate your project directory

On Windows, using Git Bash often makes commands consistent with Unix-like systems. After installing Git for Windows, open the command palette, choose Preferences: Open User Settings (JSON), and add:

"terminal.integrated.defaultProfile.windows": "Git Bash"

Restart the terminal after saving.

3.7.7 Working with Projects, Not Individual Files

In VSCodium, you should open folders, not individual files.

When you open a folder, VSCodium treats it as a project. It tracks file structure, enables Git integration, and supports project-wide search.

Use File — Open Folder and select your project directory (for example, ds4e/).

This reinforces an important habit:

Data science is organized around projects, not isolated scripts.

3.7.8 Good Practices

  • Always open a project folder rather than a single file
  • Keep all work in plain text (scripts, notes, Quarto documents)
  • Use integrated terminals for Git and language commands
  • Keep all files inside your project directory
  • Avoid storing active work on the desktop

These habits support clarity, organization, and reproducibility.

3.8 Gate 5: GitHub (Hosting and Collaboration for Git Projects)

This section assumes you already know the basics of Git on your own computer. GitHub adds sharing and collaboration on top of local Git.

3.8.1 What GitHub Is and Why It Matters

GitHub is an online platform that hosts Git repositories. Git manages project history on your computer, while GitHub stores a synchronized copy online. GitHub adds four capabilities that local Git alone does not provide.

  • Backup: a remote copy protects against local hardware failure.
  • Work anywhere: the same repository can be used on multiple machines; you can work offline and synchronize later.
  • Collaboration: multiple contributors can work safely without overwriting each other.
  • Review: changes can be inspected and discussed before they are merged.

Once a local repository is connected to GitHub, only two commands are needed for routine synchronization.

  • git push uploads committed local changes to GitHub.
  • git pull downloads changes from GitHub and integrates them locally.

3.8.2 Authentication on GitHub (SSH)

GitHub requires authentication before it accepts git push. A common method is SSH key authentication.

SSH uses a key pair.

  • Private key: stays on your computer and must never be shared.
  • Public key: added to your GitHub account to identify your computer.

Generate a key pair locally.

ssh-keygen -t ed25519 -C "your_email@example.com"

The key pair will be stored in the .ssh folder in your home directory.

Display the public key.

cat ~/.ssh/id_ed25519.pub

Copy the public key to your clipboard.

  • macOS (Terminal):
pbcopy < ~/.ssh/id_ed25519.pub
  • Windows (Git Bash):
cat ~/.ssh/id_ed25519.pub | clip
  • Windows (PowerShell):
Get-Content $env:USERPROFILE\.ssh\id_ed25519.pub | clip
  • Linux / WSL:
sudo apt update
sudo apt install xclip
xclip -selection clipboard < ~/.ssh/id_ed25519.pub

Add the public key on GitHub.

  • Profile menu — Settings
  • SSH and GPG keys — New SSH key
  • Paste the key and save

Test the connection.

ssh -T git@github.com

For up-to-date authentication, one may ask an AI assistant for help for your operating system. For example:

  • “How do I set up SSH keys for GitHub on Windows using Git Bash and verify that git push works?”

3.8.3 Creating and Publishing a New Repository

This workflow applies only if you already have a local Git repository. If not, create one first with git init.

First, create a new empty repository on GitHub (do not initialize it with a README). Then connect your local repository to GitHub.

git remote add origin git@github.com:yourname/example-repo.git
git branch -M main
git push -u origin main

Here, yourname is your GitHub username and example-repo is the repository name you chose on GitHub.

3.8.4 Cloning an Existing Repository

Cloning is used when the repository already exists on GitHub. Suppose that someone is the GitHub username or organization that owns a repository called project. To clone this repository to your own computer, first cd to the folder where you want put the repository and then:

git clone git@github.com:someone/project.git

Cloning creates a new local directory containing the full project history.

3.9 First Data Science Project: Putting Them All Together

This project section is a guided recap. You will reuse the same tools from the earlier gates, in the same order.

This chapter integrates everything from Part I—command line, VSCodium, Git, GitHub, Quarto, and Python/R—into a single, coherent project workflow. The goal is to let students experience how real data science work is done: create a clean folder, version it with Git, write a reproducible Quarto file, and share the final analysis.

3.9.1 Choosing a Simple, Meaningful Dataset

The first step in any project is choosing a dataset that is small, clean, and intrinsically interesting. Students should select something they care about so they remain motivated while practicing the workflow.

Good examples include:
- NYC 311 complaint counts for a single neighborhood - School lunch nutrition data from USDA open data - A small sports dataset (NBA scores, soccer goals, WNBA box scores) - Trends in daily steps from a personal fitness tracker - Any two-column CSV they record themselves (date + measurement)

Best practice is to avoid large, messy datasets for this first project. Students should aim to complete end-to-end analysis, not get stuck in heavy cleaning.

3.9.2 Setting Up the Project Folder

A clean folder structure helps keep the project reproducible and organized. Students use the command line to create folders and set up a Git repository.

Recommended structure:
- data/ — raw datasets in CSV or JSON - analysis/ — Quarto notebooks - figures/ — automatically generated plots - README.md — short description of the project

Key steps:
- Use the command line to create the folder and subfolders
(mkdir my-project, cd my-project, mkdir data analysis figures) - Initialize Git with
git init - Make the first commit with
git add README.md and git commit -m "Initial project structure"

Students should verify that Git is tracking the project by running git status and confirming the working tree is clean.

3.9.3 Writing a Full Quarto Analysis

The core of the project is a reproducible Quarto notebook that explains the data, code, and conclusions in one document. The notebook should include:
- A clear statement of the question (e.g., “How do 311 noise complaints differ between weekdays and weekends?”) - Code to import the dataset - Two or three meaningful visualizations (bar plots, line plots, scatterplots, histograms) - Short summary paragraphs explaining the patterns

A minimal workflow:
1. Create analysis/project.qmd in VSCodium. 2. Add a YAML header with a title, author, and format. 3. Insert code chunks to load the dataset and inspect its structure. 4. Generate plots and save outputs to the figures/ folder. 5. Render the notebook to HTML using the VS Codium Quarto extension.

Students should keep text and code in one place—not separate PowerPoints, Word files, or screenshots. Quarto ensures everything is reproducible.

3.9.4 Publishing or Sharing Work

Once the analysis renders cleanly, students can make the project public (or share privately).

Options include:

  • Push the project to GitHub with
    git add README.md analysis/project.qmd, git commit, and git push
  • Share the rendered HTML via a GitHub repository
  • Optionally, enable GitHub Pages so the report becomes a public website at https://username.github.io/my-project

This final step completes the full data science cycle: version control, reproducible notebook, and public sharing.