Introduction to Data Science

STAT 3255/5255 @ UConn, Fall 2025

Author

Jun Yan and Students in Fall 2025

Published

September 16, 2025

Preliminaries

The notes were developed with Quarto; for details about Quarto, visit https://quarto.org/docs/books.

This book is free and is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.

Sources at GitHub

These lecture notes for STAT 3255/5255 in Spring 2025 represent a collaborative effort between Professor Jun Yan and the students enrolled in the course. This cooperative approach to education was facilitated through the use of GitHub, a platform that encourages collaborative coding and content development. To view these contributions and the lecture notes in their entirety, please visit our GitHub repository at https://github.com/statds/ids-f25.

Students contributed to the lecture notes by submitting pull requests to our GitHub repository. This method not only enriched the course material but also provided students with practical experience in collaborative software development and version control.

For those interested, class notes from Spring 2025, Fall 2024, Spring 2024, Spring 2023, and Spring 2022 are also publicly accessible. These archives offer insights into the evolution of the course content and the different perspectives brought by successive student cohorts.

Compiling the Classnotes

To reproduce the classnotes output on your own computer, here are the necessary steps. See Section Compiling the Classnotes for details.

  • Clone the classnotes repository to an appropriate location on your computer; see Chapter 2  Project Management for using Git.
  • Set up a Python virtual environment in the root folder of the source; see Section Virtual Environment.
  • Activate your virtual environment.
  • Install all the packages specified in requirements.txt in your virtual environment:
pip install -r requirements.txt
  • For some chapters that need to interact with certain sites that require account information. For example, for Google map services, you need to save your API key in a file named api_key.txt in the root folder of the source.
  • Render the book with quarto render from the root folder on a terminal; the rendered book will be stored under _book.

Midterm Project

Reproduce NYC street flood research (Agonafir, Lakhankar, et al., 2022; Agonafir, Pabon, et al., 2022).

Four students will be selected to present their work in a workshop at the 2025 NYC Open Data Week. You are welcome to invite your family and friends to join the workshop.

Final Project

Students are encouraged to start designing their final projects from the beginning of the semester. There are many open data that can be used. Here is a list of useful data challenges:

Adapting to Rapid Skill Acquisition

In this course, students are expected to rapidly acquire new skills, a critical aspect of data science. To emphasize this, consider this insightful quote from VanderPlas (2016):

When a technologically-minded person is asked to help a friend, family member, or colleague with a computer problem, most of the time it’s less a matter of knowing the answer as much as knowing how to quickly find an unknown answer. In data science it’s the same: searchable web resources such as online documentation, mailing-list threads, and StackOverflow answers contain a wealth of information, even (especially?) if it is a topic you’ve found yourself searching before. Being an effective practitioner of data science is less about memorizing the tool or command you should use for every possible situation, and more about learning to effectively find the information you don’t know, whether through a web search engine or another means.

This quote captures the essence of what we aim to develop in our students: the ability to swiftly navigate and utilize the vast resources available to solve complex problems in data science. Examples tasks are: install needed software (or even hardware); search and find solutions to encountered problems.

Wishlist

This is a wish list from all members of the class (alphabetical order, last name first, comma, then first name). Here is an example.

  • Yan, Jun
    • Make practical data science tools accessible to undergraduates.
    • Pass real-world data science project experience to students.
    • Co-develop a Quarto book in collaboration with the students.
    • Train students to participate in real data science competitions.

Add yours through a pull request; note the syntax of nested list in Markdown.

Students in 3255

  • Agostino, Michael Angelo
    • Further understanding of github and improve my workflow.
    • Build a project in order to gain real world data science experience.
  • Blake, Roger Cooley
    • Gain experience using git and github
    • Complete a project that transforms real-world data into valuable insights
    • Master control over my computer and its file system
  • Cao, Enshuo
  • Chen, Irene
    • Become proficient in using Git and Python
    • Deepen machine learning and data science skills
  • Chen, Jingang Calvin
    • gain exposure and hands on experience to machine learning models
    • become proficient in Git and Github
    • apply statistical and data analysis skills to a data science project
  • Fazzina, Sophia Carmen
    • Get comfortable using Git and GitHub
    • Work on a data science project from start to finish
    • Get a good grade in this class
  • Haerter, Alejandro Erik
  • Lawrence, Claire Elise
  • Levine, Hannah Maya
  • Lucey, Sonia Niamh
    • Learn practical applications of Data Science and Economics
    • Learn how to use Git/GitHub
    • Improve Python/coding skills
  • Mayer-Costa, Jaden Paulo
    • Gain hands on experience working with real-world data
    • Become proficient in using Git and Github.
    • Continue adding to python skills and using code editing software.
  • Milun, Lewis Aaron
    • Become proficient in Git
    • Learn more about the processes involved in Data Science
  • Montalvo, Victor Samuel
  • Patel, Sahil Sanjay
    • I want learn more about time series forecasting
    • I want to be more comfortable using git and GitHub
    • I want to learn about the applications of Data Science in Finance
  • Patel, Tulsi Manankumar
  • Perkins, Jack Thomas
    • Be able to incorporate git into my own workflow.
    • Use python to boost efficiency in data problems.
    • Become more comfortable with python for data science.
  • Saltus, Quinn Lloyd Turner
    • Gain proficiency in data visualization with Python
    • Build experience using version control to expand the scope of my projects
    • Learn when and how to apply libraries (such as numpy & pytorch) to improve my code’s performance
    • Become familiar with machine learning tools and techniques
  • Saxena, Aanya
  • Schlessel, Jacob E
    • Learn about different classification algorithms
    • Practice using python for analyzing data
    • Become comfortable with Git and Github
  • Sgro, Owen Bentley
  • Tang, Wilson Chen
    • I want to be very comfortable with using GitHub and GitBash
    • I want to learn how to have a clear style
    • I want to use programming tools in a professional way
  • Tran, Justin
    • To become proficient in the knowledge of Git.
    • To adopt command line knowledge into the workforce.
    • Fostering good practices with commits.
  • White, Abigail Lynn
    • Become more comfortable working with GitHub.
    • Learn and memorize more commands used in the Terminal.
    • Advance my statistical skills through a data science project.
  • Wishneski, Emma Irene
  • Yoon, Jessica Nari
  • Zhang, Mark Justin
    • Get comfortable with command line and Git
    • Learn to make my own data science projects
    • learn theory and application of ML algorithms

Students in 5255

  • Anzalone, Matthew James
    • Build professional workflow habits for a data-science career
    • Increase my Python knowledge enough that I could eventually create usable packages
    • Improve my data-driven thinking outside of the bounds of economics
  • Gomez-Haibach, Konrad
  • Plotnikov, Alexander
    • Gain working knowledge of Git and project management.
    • Learn AI and Machine Learning algorithms.
    • Apply skillset by working on different projects.

Course Logistics

Presentation Orders

The topic presentation order is set up in class.

with open('rosters/3255.txt', 'r') as file:
    ug = [line.strip() for line in file]
with open('rosters/5255.txt', 'r') as file:
    gr = [line.strip() for line in file]
presenters = ug + gr

import random
## seed jointly set by the class
random.seed(2819 + 4075 + 6227 + 5139 + 4768 + 109)
random.sample(presenters, len(presenters))
## random.shuffle(presenters) # This would shuffle the list in place
['Chen,Jingang Calvin',
 'Montalvo,Victor Samuel',
 'Cao,Enshuo',
 'Chen,Irene',
 'Haerter,Alejandro Erik',
 'Saxena,Aanya',
 'Gomez-Haibach,Konrad',
 'Sgro,Owen Bentley',
 'Lucey,Sonia Niamh',
 'Yoon,Jessica Nari',
 'Patel,Sahil Sanjay',
 'Plotnikov,Alexander',
 'Milun,Lewis Aaron',
 'Patel,Tulsi Manankumar',
 'Perkins,Jack Thomas',
 'Fazzina,Sophia Carmen',
 'Tang,Wilson Chen',
 'Wishneski,Emma Irene',
 'Lawrence,Claire Elise',
 'White,Abigail Lynn',
 'Mayer-Costa,Jaden Paulo',
 'Anzalone,Matthew James',
 'Zhang,Mark Justin',
 'Saltus,Quinn Lloyd Turner',
 'Tran,Justin',
 'Levine,Hannah Maya',
 'Blake,Roger Cooley',
 'Schlessel,Jacob E',
 'Agostino,Michael Angelo']

Switching slots is allowed as long as you find someone who is willing to switch with you. In this case, make a pull request to switch the order and let me know.

You are welcome to choose a topic that you are interested the most, subject to some order restrictions. For example, decision tree should be presented before random forest or extreme gradient boosting. This justifies certain requests for switching slots.

Presentation Task Board

Talk to the professor about your topics at least one week prior to your scheduled presentation. Here are some example tasks:

  • Markdown jumpstart
  • Effective data science communication
  • Import/Export data
  • Data manipulation with Pandas
  • Accessing US census data
  • Arrow as a cross-platform data format
  • Statistical analysis for proportions and rates
  • Database operation with Structured query language (SQL)
  • Grammar of graphics
  • Handling spatial data
  • Spatial data with GeoPandas
  • Visualize spatial data in a Google map with gmplot
  • Animation
  • Support vector machine
  • Random forest
  • Gradient boosting machine
  • Naive Bayes
  • Neural networks basics
  • MLP/ANN/CNN/RNN/LSTM
  • Uniform manifold approximation and projection
  • Automatic differentiation
  • Deep learning
  • TensorFlow
  • Autoencoders
  • K-means clustering
  • Principal component analysis
  • Reinforcement learning
  • Developing a Python package
  • Web scraping
  • Personal webpage on GitHub
  • Making presentations with Quarto

Topic Presentation Schedule

The topic presentation is 20 points. It includes:

  • Topic selection consultation on week in advance (4 points).
  • Delivering the presentation in class (10 points).
  • Contribute to the class notes within two weeks following the presentation (6 points).

Please use the following table to sign up.

Date Presenter Topic
09/18 Chen, Jingang Calvin
09/23 Montalvo, Victor Samuel
09/25 Cao, Enshuo
09/30 Chen, Irene
10/02 Haerter, Alejandro Erik
10/02 Saxena, Aanya
10/07 Gomez-Haibach, Konrad
10/07 Sgro, Owen Bentley
10/09 Lucey, Sonia Niamh
10/09 Yoon, Jessica Nari
10/14 Patel, Sahil Sanjay
10/14 Plotnikov, Alexander
10/16 Milun, Lewis Aaron
10/16 Patel, Tulsi Manankumar
10/23 Perkins, Jack Thomas
10/23 Fazzina, Sophia Carmen
10/28 Tang, Wilson Chen
10/28 Wishneski, Emma Irene
10/30 Lawrence, Claire Elise
11/04 White, Abigail Lynn
11/04 Mayer-Costa, Jaden Paulo
11/06 Anzalone, Matthew James
11/06 Zhang, Mark Justin
11/11 Saltus, Quinn Lloyd Turner
11/11 Tran, Justin
11/11 Levine, Hannah Maya
11/13 Blake, Roger Cooley
11/13 Schlessel, Jacob E
11/13 Agostino, Michael Angelo

Final Project Presentation Schedule

We use the same order as the topic presentation for undergraduate final presentation. An introduction on how to use Quarto to prepare presentation slides is available under the templates directory in the classnotes source tree, thank to Zachary Blanchard, which can be used as a template to start with.

Date Presenter
11/18 Chen, Jingang Calvin; Montalvo, Victor Samuel; Cao, Enshuo; Chen, Irene; Haerter, Alejandro Erik
11/20 Saxena, Aanya; Sgro, Owen Bentley; Lucey, Sonia Niamh; Yoon, Jessica Nari; Patel, Sahil Sanjay
12/02 Milun, Lewis Aaron; Patel, Tulsi Manankumar; Perkins, Jack Thomas; Fazzina, Sophia Carmen; Tang, Wilson Chen
12/04 Wishneski, Emma Irene; Lawrence, Claire Elise; White, Abigail Lynn; Mayer-Costa, Jaden Paulo; Zhang, Mark Justin
12/??? Saltus, Quinn Lloyd Turner; Tran, Justin; Levine, Hannah Maya; Blake, Roger Cooley; Schlessel, Jacob E; Agostino, Michael Angelo

Contributing to the Class Notes

Contribution to the class notes is through a `pull request’.

  • Start a new branch and switch to the new branch.
  • On the new branch, add a qmd file for your presentation
  • If using Python, create and activate a virtual environment with requirements.txt
  • Edit _quarto.yml add a line for your qmd file to include it in the notes.
  • Work on your qmd file, test with quarto render.
  • When satisfied, commit and make a pull request with your quarto files and an updated requirements.txt.

I have added a template file _mysection.qmd as an example, which is includeed in index.qmd. See also how _ethics.qmd is included into intro.qmd for example.

Tips on making contributions:

  • No plagiarism.
  • Avoid external graphics.
  • Use simulated data.
  • Use data from homework assignments.
  • Cite article/book references (learn how from our sources).
  • Include a subsection of Further Readings.
  • Test on your own computer before making a pull request.
  • Send me your presentation two days in advance for feedbacks.

For more detailed style guidance, please see my notes on statistical writing.

Plagiarism is to be prevented. Remember that these class notes are publicly available online with your names attached. Here are some resources on how to avoid plagiarism. In particular, in our course, one convenient way to avoid plagiarism is to use our own data (e.g., NYC Open Data). Combined with your own explanation of the code chunks, it would be hard to plagiarize.

Homework Logistics

Workflow of Submitting Homework Assisngment

  • Click the GitHub classroom assignment link in HuskCT announcement.
  • Accept the assignment and follow the instructions to an empty repository.
  • Make a clone of the repo at an appropriate folder on your own computer with git clone.
  • Go to this folder, add your qmd source, work on it, and group your changes to different commits.
  • Push your work to your GitHub repo with git push.
  • Create a new release and put the generated pdf file in it for ease of grading.

Requirements

  • Use the repo from Git Classroom to submit your work. See Chapter 2  Project Management.
    • Keep the repo clean (no tracking generated files).
      • Never “Upload” your files; use the git command lines.
      • Make commit message informative (think about the readers).
    • Make at least 10 commits and form a style of frequent small commits.
  • Track quarto sources only in your repo. See Chapter 3  Reproducible Data Science.
  • For the convenience of grading, add your standalone html or pdf output to a release in your repo.
  • For standalone pdf output, you will need to have LaTeX installed.

Quizzes about Syllabus

  • Do I accept late homework?
  • Could you list a few examples of email etiquette?
  • How would you lose style points?
  • Would you use CLI and GUI?
  • What’s the first date on which you have to complete something about your final project?
  • Can you use AI for any task in this course?
  • If you need a reference letter, how could you help me to help you?

Practical Tips

Data analysis

  • Use an IDE so you can play with the data interactively
  • Collect codes that have tested out into a script for batch processing
  • During data cleaning, keep in mind how each variable will be used later
  • No keeping large data files in a repo; assume a reasonable location with your collaborators

Presentation

  • Don’t forget to introduce yourself if there is no moderator.
  • Highlight your research questions and results, not code.
  • Give an outline, carry it out, and summarize.
  • Use your own examples to reduce the risk of plagiarism.

My Presentation Topic (Template)

This section was prepared by John Smith.

Use Markdown syntax. If not clear on what to do, learn from the class notes sources.

  • Pay attention to the sectioning levels.
  • Cite references with their bib key.
  • In examples, maximize usage of data set that the class is familiar with.
  • Could use datasets in Python packages or downloadable on the fly.
  • Test your section by quarto render <filename.qmd>.

Introduction

Here is an overview.

Sub Topic 1

Put materials on topic 1 here

Python examples can be put into python code chunks:

# import pandas as pd

# do something

Sub Topic 2

Put materials on topic 2 here.

Sub Topic 3

Put matreials on topic 3 here.

Conclusion

Put sumaries here.

Further Readings

Put links to further materials.