3 Reproducible Data Science

Data science projects should be reproducible to be trustworthy. Dynamic documents facilitate reproducibility. Quarto is an open-source dynamic document preparation system, ideal for scientific and technical publishing. From the official websites, Quarto can be used to:

Create dynamic content with Python, R, Julia, and Observable.
Author documents as plain text markdown or Jupyter notebooks.
Publish high-quality articles, reports, presentations, websites, blogs, and books in HTML, PDF, MS Word, ePub, and more.
Author with scientific markdown, including equations, citations, cross references, figure panels, callouts, advanced layout, and more.

3.1 Introduction to Quarto

To get started with Quarto, see documentation at Quarto.

For a clean style, I suggest that you use VS Code as your IDE. The ipynb files have extra formats in plain texts, which are not as clean as qmd files. There are, of course, tools to convert between the two representations of a notebook. For example:

quarto convert hello.ipynb # converts to qmd
quarto convert hello.qmd   # converts to ipynb

We will use Quarto for homework assignments, classnotes, and presentations. You will see them in action through in-class demonstrations. The following sections in the Quarto Guide are immediately useful.

A template for homework is in this repo (hwtemp.qmd) to get you started with homework assignments.

3.2 Compiling the Classnotes

The sources of the classnotes are at https://github.com/statds/ids-s25. This is also the source tree that you will contributed to this semester. I expect that you clone the repository to your own computer, update it frequently, and compile the latest version on your computer (reproducibility).

To compile the classnotes, you need the following tools: Git, Quarto, and Python.

3.2.1 Set up your Python Virtual Environment

I suggest that a Python virtual environment for the classnotes be set up in the current directory for reproducibility. A Python virtual environment is simply a directory with a particular file structure, which contains a specific Python interpreter and software libraries and binaries needed to support a project. It allows us to isolate our Python development projects from our system installed Python and other Python environments.

To create a Python virtual environment for our classnotes:

python3 -m venv .ids-s25-venv

Here .ids-s25-venv is the name of the virtual environment to be created. Choose an informative name. This only needs to be set up once.

To activate this virtual environment:

. .ids-s25-venv/bin/activate

After activating the virtual environment, you will see (.ids-s25-venv) at the beginning of your shell prompt. Then, the Python interpreter and packages needed will be the local versions in this virtual environment without interfering your system-wide installation or other virtual environments.

To install the Python packages that are needed to compile the classnotes, we have a requirements.txt file that specifies the packages and their versions. They can be installed easily with:

pip install -r requirements.txt

If you are interested in learning how to create the requirements.txt file, just put your question into a Google search.

To exit the virtual environment, simply type deactivate in your command line. This will return you to your system’s global Python environment.

3.2.2 Clone the Repository

Clone the repository to your own computer. In a terminal (command line), go to an appropriate directory (folder), and clone the repo. For example, if you use ssh for authentication:

git clone git@github.com:statds/ids-s25.git

3.2.3 Render the Classnotes

Assuming quarto has been set up, we render the classnotes in the cloned repository

cd ids-s25
quarto render

If there are error messages, search and find solutions to clear them. Otherwise, the html version of the notes will be available under _book/index.html, which is default location of the output.

3.2.4 Login Requirements

For some illustrations, you need to interact with certain sites that require account information. For example, for Google map services, you need to save your API key in a file named api_key.txt in the root folder of the source. Another example is to access the US Census API, where you would need to register an account and get your Census API Key.

3.3 The Data Science Life Cycle

This section summarizes Chapter 2 of Veridical Data Science (Yu & Barter, 2024), which introduces the data science life cycle (DSLC). The DSLC provides a structured way to think about the progression of data science projects. It consists of six stages, each with a distinct purpose:

Stage 1: Problem formulation and data collection
Collaborate with domain experts to refine vague questions into ones that can realistically be answered with data. Identify what data already exists or design new collection protocols. Understanding the collection process is crucial for assessing how data relates to reality.
Stage 2: Data cleaning, preprocessing, and exploratory data analysis
Clean data to make it tidy, unambiguous, and correctly formatted. Preprocess it to meet the requirements of specific algorithms, such as handling missing values or scaling variables. Exploratory data analysis (EDA) summarizes patterns using tables, statistics, and plots, while explanatory data analysis polishes visuals for communication.
Stage 3: Exploring intrinsic data structures (optional)
Techniques such as dimensionality reduction simplify data into lower-dimensional forms, while clustering identifies natural groupings among observations. Even if not central to the project, these methods often enhance understanding.
Stage 4: Predictive and/or inferential analysis (optional)
Many projects are cast as prediction tasks, training algorithms like regression or random forests to forecast outcomes. Inference focuses on estimating population parameters and quantifying uncertainty. This book emphasizes prediction while acknowledging inference as important in many domains.
Stage 5: Evaluation of results
Findings should be evaluated both qualitatively, through critical thinking, and quantitatively, through the PCS framework. PCS stands for predictability, computability, and stability:
- Predictability asks whether findings hold up in relevant future data.
- Computability asks whether methods are feasible with available computational resources.
- Stability asks whether conclusions remain consistent under reasonable changes in data, methods, or judgment calls.
  Together, PCS provides a foundation for assessing the reliability of data-driven results.
Stage 6: Communication of results
Results must be conveyed clearly to intended audiences, whether through reports, presentations, visualizations, or deployable tools. Communication should be tailored so findings can inform real-world decisions.

The DSLC is not a linear pipeline—analysts often loop back to refine earlier steps. The chapter also cautions against data snooping, where patterns discovered during exploration are mistaken for reliable truths. Applying PCS ensures that results are not only technically sound but also trustworthy and interpretable across the life cycle.

3.4 A Primer of Markdown

This section was prepared by Jingang Chen, an undergraduate junior pursuing a dual degree in computer science and statistical data science.

3.4.1 Introduction

This section will focus on the syntax of Markdown, which is a lightweight markup language that allows to user to write content in plain text format, which can be rendered to various formats like HTML and PDF, and is widely used in open-source documentation.

3.4.2 Headers

In markdown, creating heading levels for sections and subsections are denoted by # (atx-style) in the beginning of the line. The number of hashtags denote the heading level, with more hashtags indicating smaller heading levels. There are a total of 6 heading levels using the hashtags.

# Header 1 (Main)
## Header 2 (Subheading)
### Header 3 (Subheading)
#### Header 4
##### Header 5
###### Header 6

A space is required after the hashtags to denote that it is a heading

Headings can also be denoted by using underlined = and - signs, though this will only work for the first and second level headers respectively.

Header 1
===
Header 2
----

Any amount of = and - signs will work to create those top two headings.

3.4.3 Paragraph and Line Break Convention

To seperate text into paragraphs, make sure there is at least one blank space between the blocks of texts.

This is the first pargraph which can contain multiple lines, and as long there
is no blank line in this, then this block of text is one paragraph.

This is an example of a second paragraph. This one is seperated by a blank line
from the paragraph above to denote a different paragraph.

Output:

This is the first pargraph which can contain multiple lines, and as long there is no blank line in this, then this block of text is one paragraph.

This is an example of a second paragraph. This one is seperated by a blank line from the paragraph above to denote a different paragraph.

To create a line break within a paragraph, end whatever line you’re on with two spaces, then press enter/return to start a new line. The break tag <br> is also sufficient.

This is one line.  
This is a line that is seperated from the line above using two spaces. <br>
This is a third line seperated using the `<br>` break tag.

Output:

This is one line.
This is a line that is seperated from the line above using two spaces.
This is a third line seperated using the <br> break tag.

<> in Markdown is a HTML tag, which is another way to structure the document

3.4.4 Horizontal Rules

Horizontal rules visually separate document sections, which can be done in Markdown by adding 3 or more of one of 3 characters: ***, ---, or ___.


paragraph 1.

******

paragraph 2.

------

paragraph 3.

______

paragraph 4.

Output:

paragraph 1.

paragraph 2.

paragraph 3.

paragraph 4.

Make sure when using the horizontal rulers, each of them is one blank line above and below them. There shouldn’t be any text above or below them adjacently.

3.4.5 Text Formatting

3.4.5.1 Bolding and Italics

Italic text, uses single asterisks (*) or underscores (_) around the text, while bold text uses double asterisks or underscores. To make the text both bolded and italicized, put three asterisks or underscores around the text.

Syntax	Output
`Italicized` `_Italicized_`	Italicized
`Bolded` `__Bolded__`	Bolded
`*bold & italics*` `___bold & italics___`	*bolded & italics*

Emphasis can also be placed within a word as well, but only * can be used, not _.

Syntax	Output
`superfragalist expialidocious`	superfragalist expialidocious

3.4.5.2 Strikethrough

Strikethrough uses ~~ around the text to cross it out.

Syntax	Output
`~~strikethrough text~~`	~~strikethrough text~~

3.4.5.3 Superscript and Subscript

To superscript, use ^ around the desired text.

For subscript, use ~ around the desired text.

Syntax	Output
`superscript^2^`	superscript²
`subscript~2~`	subscript₂

3.4.5.4 Underlining and Highlighting

To underline, bracket the desired text with [], and then follow that using {.underline}.

To highlight, bracket the desired text with [], and follow that using {.mark}. Alternatively you can start the text with <mark> and end it with </mark>.

Syntax	Output
`[underlined text]{.underline}`	underlined text
`[highlighted]{.mark}` `<mark>higlighted</mark>`	highlighted

3.4.5.5 Escape Characters

If you want to display the Markdown syntax characters, it can be done by putting \ before and after the text.

\# Not a heading\
\**not bolded**\
\[Not a link]\

Output:

# Not a heading
**not bolded**
[Not a link]

3.4.6 Blockquotes

Blockquotes is a way to highlight quoted content or important information in the document, which is denoted by > in the beginning of the line.

> This is a block quote
> 
> second block quote
>
> third block quote

Output:

This is a blockquote

second blockquote

third blockquote

To make sure that the blockquotes are seperated, make sure that each quote is seperated by >’s with no text in that line. Putting blockquotes adjacent to each other will result in all the text being in the same paragraph.

> These blockquotes
> are not seperated and
> are all in one line

Output:

These blockquotes are not seperated and are all in one line

Blockquotes can also be nested by using multiple > in one line.

> This is a blockquote
>
> > This is a nested blockquote
> >
> > > Third level blockquote

Output:

This is a blockquote

This is a nested blockquote

Third level blockquote

3.4.7 Lists

Lists and nested lists can be structured in Markdown either unordered or ordered. For nested lists, make sure to include 4 spaces to properly indent the nested list (applies to both ordered and unordered).

3.4.7.1 Unordered Lists

For unordered lists, *, +, or - can be used to make a list.

* Item 1
    * Subitem
        * Another subitem
* Item 2
* Item 3

and

+ Item 1
   + Subitem
       + Another subitem
+ Item 2
+ Item 3

- Item 1
   - Subitem
       - Another subitem
- Item 2
- Item 3

as well a mix of all three:

* Item 1
   + Subitem
       - Another subitem
* Item 2
* Item 3

all yield the same output:

Item 1
- Subitem
  - Another subitem
Item 2
Item 3

3.4.7.2 Ordered Lists

Ordered lists use numbers with periods.

1. Item 1
2. Item 2
    1. Sub item
3. Item 3

The numbers don’t necessarily have to be ordered, and they can be duplicated as well. Whatever number the list starts on will be the one that it will count from no matter the numbers that come after it.

1. Item 1
1. Item 2
    1. Sub Item
1. Item 3

1. Item 1
4. Item 2
    2. Sub Item
7. Item 3

All three of these lists will yield the same result:

Item 1
Item 2
1. Sub Item
Item 3

However, if the list were to start on 2, it would start counting from 2 no matter the order that follows.

2. Item 1
2. Item 2
5. Item 3

Output:

Item 1
Item 2
Item 3

3.4.7.3 Task List

To denote the lists as a series of tasks, use - [ ], which is unchecked, and [x], which is checked, at the beginning of the line.

- [ ] Task 1
- [x] Task 2

Output:

Task 1
Task 2

3.4.7.4 Definition lists

Defintion lists can be created with the following convention:

term
: defintion

term2
: definition2

Output:

term: defintion
term2: definition2

3.4.7.5 Some Additional Notes About Lists

There are some additional features that can be done with the lists mentioned.

A list can continue after a break in between. For ordered lists, the numbering still follows through after an interruption.

1. Item 1

interruption text

2. Item 2

Output:

Item 1

interruption text

Item 2

Text than isn’t numbered or in bullet points can also be added below list using four spaces for indenting. Code chunks can be added in this case as well.

1. ordered list
2. item 2
    continued after indenting 4 spaces
    ```python
    print("Hello, World!")
    ```
    A. sub-sub-item 1

Output:

ordered list
item 2

continued after indenting 4 spaces
```
print("Hello, World!")
```
1. sub-sub-item 1 ### Code

3.4.7.6 Inline Code

Markdown allows for inline code and code blocks.

To insert inline code, use the backticks ` around the text

This is an example of `inline code` in a text.

Output: This is an example of inline code in a text.

To display the ` as part of an inline code, surround the character with `` backticks and spacing them apart from `.

3.4.7.7 Code Blocks

To create code blocks, ``` can be used.

    ```
    some code
    ```

Output:

some code

Alternatively, code blocks can be indented by identing four or more spaces prior to the code.

This is code using the indentation of four spaces
    This line has more than four spaces

A language can also be added to specify the language of the code blocks if ``` is used.

```python
print('some python code')
```

Output:

print('some python code')

To make the code executable, put {} around the syntax language being used in the code blocks.

    ```{python}
    print("Some python code")
    ```

Output:

print('some python code')

some python code

3.4.8 Formulas and Equations

Markdown supports LaTeX-style expressions. Mathematical expressions can either be done inline (enclosing using $) or for display math (enclosed by $$).

Syntax	Output
`Inline: $x^2 + y^2 = z^2$`	Inline: $x^2 + y^2 = z^2$
`Display:` `$$x^2 + y^2 = z^2$$`	Display: \[x^2 + y^2 = z^2 \tag{3.1}\]

For mor information on how to use LaTeX expressions, visit https://www.overleaf.com/learn and look under the “Mathematics” section.

3.4.9 Link Embedding

Links in Markdown can either be added inline or as a reference.

To add an inline link, use [] around the text that leads to the link, followed by actually inputting the link in ().

this is an example link that will lead to the 
[Markdown Guide](https://www.markdownguide.org/)

Output: : This is an example link that will lead to the Markdown Guide.

Alternatively, for inline links, the URL can be directly added without linking it to text by just using <> around the URL.

Link to <https://www.markdownguide.org/>.

Output: Link to https://www.markdownguide.org/.

For reference-style links, you can having text linked to the URL and a seperate number or text that points to the link with []

Example of reference-style link leading to the [Markdown Guide][link].

[link]: https://www.markdownguide.org

Output:

Example of reference-style link leading to the Markdown Guide.

3.4.10 Images

To insert an image into markdown, the convention is to first add !, followed by a caption to the image wrapped in [], and finally the file path or URL to the image enclosed with ().

In addition, the image can be embedded with a link by first wrapping the 3 parameters mentioned above in [], followed by the link wrapped in (). The link can either be a local file path stored on your computer or a direct link of the image found online. For reproducibility, this example uses a direct URL pointing to an online image.

[![UConn Husky Logo](https://lofrev.net/wp-content/photos/2016/06/uconn_huskies_logo.jpg)](https://uconn.edu/)

Output:

3.4.11 Tables

To create tables, the | character is used to seperate the table into columns, while the - is used to seperate the headers of the table from the rest of the data. After the table, a header can be included two lines below the table starting with :.

| Col 1 | Col 2 | Col 3 | Col 4 |
|------|-----|----------|-------|
|   a  |  b  |    c     |  d    |
|  e   |  f  |  g       |  h    |
|   i  |   j |    k     |    l  |

: Sample Table 1

Output:

Table 3.1: Sample Table 1

Col 1	Col 2	Col 3	Col 4
a	b	c	d
e	f	g	h
i	j	k	l

Columns can also be aligned to the left, right or center by add a colong : to the left, right or on both sides of the -’s of the table.

| Right Col | Left Col | Center | Default |
|----------:|:---------|:------:|---------|
|        a  |       b  |    c   |    d    |
|  e        |       f  |   g    |    h    |
|   i       |        j |  k     |      l  |

: Sample Table 2

Output:

Sample Table 2
Right Col	Left Col	Center	Default
a	b	c	d
e	f	g	h
i	j	k	l

3.4.12 Cross Referencing

Prior to cross referencing a section, there must be a label attached to the section that is being referenced. This is done by using {} after the section and giving it a label inside of it starting with #sec- followed by anything else after it.

### Tables {#sec-tables}

Then, to reference the section, use the @ followed by the label specified to create a direct link to the section. Optionally, you can wrap it around [] and add any additional text to the link.

Refer back to the Tables section in [section @sec-tables].

Output:

Refer back to the Tables section in section 3.4.11.

To refer to figures and images, make sure to input the label after the URL. Make sure in the label to start with #fig- to specify that it is a figure that is being referenced.

[![UConn Husky Logo](https://lofrev.net/wp-content/photos/2016/06/uconn_huskies_logo.jpg){#fig-sample}](https://uconn.edu/)

Now it can be referred back with the same convention as referencing a section:

This refers @fig-sample in the Images section.

Output:

This refers Figure 3.1 in the Images section.

To cross reference tables, include the label on the line where the header is specified. Make sure to start the label with #tbl- to specify that it’s a table that is being referenced.

| Col 1 | Col 2 | Col 3 | Col 4 |
|------|-----|----------|-------|
|   a  |  b  |    c     |  d    |
|  e   |  f  |  g       |  h    |
|   i  |   j |    k     |    l  |

: Sample Table 1 {#tbl-sample1}

Now to reference the table back:

@tbl-sample1 refers back to the first table in the Tables section.

Output:

Table 3.1 refers back to the first table in the Tables section.

Finally, to reference an equation, make sure to include the label after the equation, outside of the $$ starting with #eq-.

$$x^2 + y^2 = z^2$$ {#eq-sample}

Now to reference the equation from the Equations section back:

@eq-sample is the Pythagorean theorem referenced from the Equations section.

Output:

Equation 3.1 is the Pythagorean theorem referenced from the Equations section.

3.4.13 Footnotes

Footnotes, often denoted by superscripts, are placed at the bottom of a page in a document, which helps provide additional information and references related to a specific part of the text. To insert a footnote in a text, it is done in the [], where the first character inside of the brackets is ^, followed by the desired name of the footnote. After that is specified, reference that footnote in a newline and put whatever note that is needed.

This is where you can place a footnote,[^1] sometimes multiple can be placed in 
one sentence.[^longnote]

[^1]: This is a footnote.

[^longnote]: This is a long footnote, which can have paragraphs.
   
    Make sure to use the four spaces to inent so that the subsequent paragraphs 
    belong to the same footnote.

    ```
    {code can also be inserted in here}
    ```

    End footnote

Seperate paragraph here to show that this isn't part of the footnote.

Output:

This is where you can place a footnote,¹ sometimes multiple can be placed in one sentence²

Seperate paragraph here to show that this isn’t part of the footnote.

3.4.14 Conclusion

Markdown is a versatile markup language that simplifies writing for the web in a way that is readable and convenient. Today, it is widely used to present the work that is being done in a variety of areas, and Markdown provides a clean way to organize the content being presented and structure the documents well.

3.4.15 Further Reading

This is a footnote.↩︎
This is a long footnote, which can have paragraphs.

Make sure to use the four spaces to inent so that the subsequent paragraphs belong to the same footnote.
```
{code can also be inserted in here}
```
End footnote↩︎