3 Reproducible Data Science
Data science projects should be reproducible to be trustworthy. Dynamic documents facilitate reproducibility. Quarto is an open-source dynamic document preparation system, ideal for scientific and technical publishing. From the official websites, Quarto can be used to:
- Create dynamic content with Python, R, Julia, and Observable.
- Author documents as plain text markdown or Jupyter notebooks.
- Publish high-quality articles, reports, presentations, websites, blogs, and books in HTML, PDF, MS Word, ePub, and more.
- Author with scientific markdown, including equations, citations, cross references, figure panels, callouts, advanced layout, and more.
3.1 Introduction to Quarto
To get started with Quarto, see documentation at Quarto.
For a clean style, I suggest that you use VS Code as your IDE. The ipynb files have extra formats in plain texts, which are not as clean as qmd files. There are, of course, tools to convert between the two representations of a notebook. For example:
quarto convert hello.ipynb # converts to qmd
quarto convert hello.qmd   # converts to ipynbWe will use Quarto for homework assignments, classnotes, and presentations. You will see them in action through in-class demonstrations. The following sections in the Quarto Guide are immediately useful.
A template for homework is in this repo (hwtemp.qmd) to get you started with homework assignments.
3.2 Compiling the Classnotes
The sources of the classnotes are at https://github.com/statds/ids-f24. This is also the source tree that you will contributed to this semester. I expect that you clone the repository to your own computer, update it frequently, and compile the latest version on your computer (reproducibility).
To compile the classnotes, you need the following tools: Git, Quarto, and Python.
3.2.1 Set up your Python Virtual Environment
I suggest that a Python virtual environment for the classnotes be set up in the current directory for reproducibility. A Python virtual environment is simply a directory with a particular file structure, which contains a specific Python interpreter and software libraries and binaries needed to support a project. It allows us to isolate our Python development projects from our system installed Python and other Python environments.
To create a Python virtual environment for our classnotes:
python3 -m venv .ids-f24-venvHere .ids-f24-venv is the name of the virtual environment to be created. Choose an informative name. This only needs to be set up once.
To activate this virtual environment:
. .ids-f24-venv/bin/activateAfter activating the virtual environment, you will see (.ids-f24-venv) at the beginning of your shell prompt. Then, the Python interpreter and packages needed will be the local versions in this virtual environment without interfering your system-wide installation or other virtual environments.
To install the Python packages that are needed to compile the classnotes, we have a requirements.txt file that specifies the packages and their versions. They can be installed easily with:
pip install -r requirements.txtIf you are interested in learning how to create the requirements.txt file, just put your question into a Google search.
To exit the virtual environment, simply type deactivate in your command line. This will return you to your system’s global Python environment.
3.2.2 Clone the Repository
Clone the repository to your own computer. In a terminal (command line), go to an appropriate directory (folder), and clone the repo. For example, if you use ssh for authentication:
git clone git@github.com:statds/ids-f24.git3.2.3 Render the Classnotes
Assuming quarto has been set up, we render the classnotes in the cloned repository
cd ids-f24
quarto renderIf there are error messages, search and find solutions to clear them. Otherwise, the html version of the notes will be available under _book/index.html, which is default location of the output.
