Introduction to Data Science

STAT 3255/5255 @ UConn, Spring 2026

Authors
Published

February 17, 2026

Preliminaries

The notes are developed with Quarto; for details about Quarto, visit https://quarto.org/docs/books.

Acknowledgement

These lecture notes for STAT 3255/5255 in Spring 2026 will be built upon the notes from Professor Jun Yan and former students enrolled in the course.

For those interested, class notes from Fall 2025, Spring 2025, Fall 2024, Spring 2024, Spring 2023, and Spring 2022 are publicly accessible. These archives offer insights into the evolution of the course content and the different perspectives brought by successive student cohorts.

Sources at GitHub

We will adopt a cooperative approach, facilitated through the use of GitHub, a platform that encourages collaborative coding and content development. To view these contributions and the lecture notes in their entirety, please visit our GitHub repository at https://github.com/statds/ids-s26.

Students will be asked to contribute to the notes by submitting pull requests to our GitHub repository. This method not only enriched the course material but also provided students with practical experience in collaborative software development and version control.

Compiling the Classnotes

To reproduce the classnotes output on your own computer, here are the necessary steps. See Section Compiling the Classnotes for details.

  • Clone the classnotes repository to an appropriate location on your computer; see Chapter 2  Project Management for using Git.
  • Set up a Python virtual environment in the root folder of the source; see Section Virtual Environment.
  • Activate your virtual environment.
  • Install all the packages specified in requirements.txt in your virtual environment:
pip install -r requirements.txt
  • For some chapters that need to interact with certain sites that require account information. For example, for Google map services, you need to save your API key in a file named api_key.txt in the root folder of the source.
  • Render the book with quarto render from the root folder on a terminal; the rendered book will be stored under _book.

Midterm Exam

TBD

Final Project

Students are encouraged to start designing their final projects from the beginning of the semester. There are many open data that can be used. Here is a list of useful data challenges:

Adapting to Rapid Skill Acquisition

In this course, students are expected to rapidly acquire new skills, a critical aspect of data science. To emphasize this, consider this insightful quote from VanderPlas (2016):

When a technologically-minded person is asked to help a friend, family member, or colleague with a computer problem, most of the time it’s less a matter of knowing the answer as much as knowing how to quickly find an unknown answer. In data science it’s the same: searchable web resources such as online documentation, mailing-list threads, and StackOverflow answers contain a wealth of information, even (especially?) if it is a topic you’ve found yourself searching before. Being an effective practitioner of data science is less about memorizing the tool or command you should use for every possible situation, and more about learning to effectively find the information you don’t know, whether through a web search engine or another means.

This quote captures the essence of what we aim to develop in our students: the ability to swiftly navigate and utilize the vast resources available to solve complex problems in data science. Examples tasks are: install needed software (or even hardware); search and find solutions to encountered problems.

Wishlist

This is a wish list from all members of the class (alphabetical order, last name first, comma, then first name). Here is an example.

  • Chen, Kun
    • Introduce practical data science tools to undergraduates.
    • Pass real-world data science project experience to students.
    • Teach student to think critically and statistically.

Add yours through a pull request; note the syntax of nested list in Markdown.

Students in STAT 3255

  • Bennett, Emily
    • Become more experienced and confident with Python
    • Gain experience with data science problems and relate them to the real world
    • Develop better understanding in GitHub, Git, and VSCode
  • Budnick, Kayleigh
    • Develop a strong understanding of the data science process from start to finish
    • Gain experience working with political science datasets
    • Improve skills in Python and Git
  • Burns, Kyle
    • Get better at working with Python, Git, and other common data science tools
    • Gain experience with working on an actual data science project
    • Apply my knowledge of statistical methods to real world data science problems
  • Carbone, Vincenzo
    • Connect data science to the world of sports analytics.
    • Improve my literacy with VS code, quarto, and git.
    • Have an in depth understanding of python code.
  • Davis, Reid
    • Focus on real world data and modeling to solve problems
    • Work start to finish on a full fledged data science project
    • Improve my knowledge of Python, Git, Quarto and working in repositories
  • Desai, Alysha
    • Gain more hands-on experience cleaning, organizing, and breaking apart real-world datasets to better understand how raw data is transformed into usable insights.
    • Improve my skills in preparing presentations of my findings and become more confident explaining more complex topics to an audience.
    • Become more familiar with version control tools like Git and GitHub
  • Faisal, Zaynab
    • Gain a better understanding of data science and real life applications
    • Become more comfortable using Git/GitHub
    • Improve python and coding skills
  • Ibrahim, Omar
    • I want to gain expertise in using tools such as Git and Quarto in order to effectively communicate and collaborate through projects.
    • I want to learn how how to clean and preprocess raw data to prepare it for projects
    • Lastly, I want to learn how to ask the right questions and solve them using data.
  • Jiang, Ryan
    • Enhance my Python and coding abilities
    • Introduce myself to the world of Data Science
    • Becoming more comfortable with being able to present findings on specific data and explaining findings more than just the words on the slides.
  • Jones, Cody
    • Understand the role of Git in Data Science
    • Improve my communication and collaboration skills
    • Strengthen my skills with python
  • Kwak, Jinha
    • Become more comfortable with Python and more proficient at using editors
    • Learn how to use Git and GitHub for collaboration
    • Find topics I’m interested in where data science can be applied to and learn how to apply it
  • Lacasse, Violet
    • Learn how to find/grab/handle datasets.
    • Become familiar with Git and how it can be used to collaborate.
    • Learn more about data science and how it can help in my data analysis (domain: earth data science) major.
  • Landolphi, Joseph
    • Improve my efficiency with Python
    • Be able to work with real-world data sets for future results
    • Gain experience using data sets for sports analytics
  • Lawrence, Claire
    • Become more comfortable coding in Python
    • Learn to use Git and Github to collaborate with peers
    • Work on a project from start to finish using real datasets and topics that interest me
  • Liu, Kevin
    • Use git better in a collobrative environment and more complex commands
    • Apply data analysis to translate data to real-world results
    • Understand the use of Quarto in data analysis
  • Mccabe, Scott
  • Mohan, Harish
    • Enhance my skills in python
    • Gain expertise in working with raw data and translating it to real world results
    • Learn more about the data pipeline and how to use it in the real world
  • Orsini, Ronnie
    • Learn how to apply data science techniques to real world datasets
    • Improve collaboration skills using Git and GitHub
    • Better understand Quarto for data analysis
  • Patel, Reesha
    • Learn about the process between identifying a problem/question and using data science to find solutitons/get insights.
    • Improve my knowledge of packages in Python (scikit-learn, seaborn).
    • Apply what I’ve learned in my classes to actual, real-world datasets.
  • Patel, Vrajkumar
  • Sawyer, Riley
    • Learn how to process and interpret real data using Python.
    • Work in a collaborative setting using GitHub.
    • Gain experience using Quarto and GitHub.
  • Sudarsanam, Shreya
    • Improve my understanding of how to choose which programming language to use (i.e. R, Python, and SQL), and how to evaluate the strengths and weaknesses of popular languages in each step of the data analysis pipeline.
    • Become more familiar with machine learning in research and industry based contexts.
    • Learn/Demystify the process of completing a data science project from start to finish. This may involve mining, cleaning, and visualizing data and creating machine learning models to help make valuable inferences.
  • Tessman, Sean
    • Learn how data science applies to sports analytics and performance.
    • Build a strong reproducible workflow with Git, Quarto, and Python.
    • Gain experience with real-world datasets and examples. Also how to identify what problem is at task, and how to find a solution using data science.
  • Trnka, Jonathan
    • I want to get more comfortable using github, linux commands, and more.
    • Learn more about ML and AI modeling.
    • I want to apply what I know to real world data sets from beginning to end.
  • Watanabe, Sara
    • Learn how to effectively use GitHub for collaboration on data
    • Learn how to apply data science techniques to real world problems
    • Improve my Python programming skills for data analysis
  • Wolven, Alexander
    • Build hirable skills in Python, Quarto, and Git through projects.
    • Learn about machine learning and AI modeling.
    • Gain experience working with real datasets from cleaning to analysis and clear presentation.
  • Zhang, Jianan
    • Learn how to use Git and GitHub effectively for collaborative data science projects.
    • Improve Python data analysis skills (NumPy, Pandas, Matplotlib) through applied projects.
  • Zharyy, Sofia
    • To increase my comfortability with Git and Github; I have never used this for a course before.
    • To learn different and applicable methods that intermingle geographic information science and statistical data science.
    • Improve my researching capabilities to better equipe me for related careers.

Students in STAT 5255

  • Last name, First name
    • Wish 1
    • Wish 2
    • Wish 3

Course Logistics

Topic Presentation Orders

The topic presentation order is set up in class.

with open('rosters/3255.txt', 'r') as file:
    ug = [line.strip() for line in file]
with open('rosters/5255.txt', 'r') as file:
    gr = [line.strip() for line in file]
## presenters = ug + gr
presenters = [x for x in (ug + gr) if x]   # removes empty lines
print(f"Number of presenters: {len(presenters)}")

import random
## seed jointly set by the class
seed_s26 = 723 + 2026 + 125 
print(f"Random seed: {seed_s26}")
random.seed(seed_s26)
random.sample(presenters, len(presenters))
## random.shuffle(presenters) # This would shuffle the list in place
Number of presenters: 30
Random seed: 2874
['Desai, Alysha',
 'Kwak, Jinha',
 'Jackson, Brooke',
 'Sudarsanam, Shreya',
 'Sawyer, Riley',
 'Faisal, Zaynab',
 'Tessman, Sean',
 'Nash, Jayden',
 'Watanabe, Sara',
 'Jiang, Ryan',
 'Jones, Cody',
 'Davis, Reid',
 'Lawrence, Claire',
 'Trnka, Jonathan',
 'Orsini, Ronnie',
 'Patel, Vrajkumar',
 'Liu, Kevin',
 'Landolphi, Joseph',
 'Burns, Kyle',
 'Mohan, Harish',
 'Carbone, Vincenzo',
 'Bennett, Emily',
 'Budnick, Kayleigh',
 'Ibrahim, Omar',
 'Patel, Reesha',
 'Zhang, Jianan',
 'Mccabe, Scott',
 'Lacasse, Violet',
 'Wolven, Alexander',
 'Zharyy, Sofia']

Switching slots is allowed as long as you find someone who is willing to switch with you. In this case, make a pull request to switch the order and let me know.

You are welcome to choose a topic that you are interested the most, subject to some order restrictions. For example, decision tree should be presented before random forest or extreme gradient boosting. This justifies certain requests for switching slots.

Presentation Task Board

Talk to the professor about your topics at least one week prior to your scheduled presentation. Here are some example tasks:

  • Markdown jumpstart
  • Import/Export data
  • Data manipulation with Pandas
  • Accessing US census data
  • Database operation with Structured Query Language (SQL)
  • Grammar of graphics
  • Visualizing spatial data
  • Spatial data with GeoPandas
  • Visualize spatial data in a Google map with gmplot
  • Animation
  • Statistical analysis for proportions and rates
  • False discovery rate control
  • Principal component analysis
  • Multi-dimensional scaling
  • t-SNE
  • Uniform manifold approximation and projection (UMAP)
  • Autoencoders
  • K-means clustering
  • Finite mixture model
  • Least absolute shrinkage and selection operator (Lasso)
  • Logistic regression and its extensions
  • Support vector machine
  • Random forest
  • Gradient boosting machine
  • Neural networks basics
  • MLP/ANN/CNN/RNN/LSTM
  • Deep learning
  • Natural leanguage processing
  • Large language models (LLM)
  • LLM agents
  • Automatic differentiation
  • Reinforcement learning
  • Developing a Python package
  • Web scraping

Topic Presentation Schedule

The topic presentation is 20 points. It includes:

  • Topic selection consultation in advance (4 points).
  • Delivering the presentation in class (8 points). Your presentation should be about 20 minutes.
  • Contribute to the class notes within two weeks following the presentation (8 points).

Please use the following table to sign up.

Date Presenter Topic
02/16 Desai, Alysha K-Means Clustering
02/16 Kwak, Jinha Data manipulation with ‘Pandas’
02/18 Sudarsanam, Shreya Database operation with Structured Query Language (SQL)
02/18
02/23 Sawyer, Riley
02/23 Faisal, Zaynab Import/Export Data
02/25 Tessman, Sean Web Scraping
02/25
03/02 Watanabe, Sara Random Forest
03/02 Jiang, Ryan Animation
03/04 Jones, Cody Visualizing Spatial Data
03/04 Davis, Reid Markov Chains in Python
03/09 Lawrence, Claire Visualizing Spatial Data
03/09 Trnka, Jonathan LLM Agents
03/11 Orsini, Ronnie Logistic Regression
03/11 Patel, Vrajkumar
03/23 Liu, Kevin Natural leanguage processing
03/23 Landolphi, Joseph Exploratory analysis of sports/softball data
03/25 Burns, Kyle Support vector machine
03/25 Mohan, Harish Large language models (LLM)
03/30 Carbone, Vincenzo Neural networks (used in predicting sports statistics)
03/30 Bennett, Emily Lasso
04/01 Budnick, Kayleigh Statistical analysis for proportions and rates
04/01 Ibrahim, Omar
04/06 Patel, Reesha Intro to PowerBi & Tableau for Data Visualization
04/06 Zhang, Jianan
04/08 Mccabe, Scott
04/08 Lacasse, Violet Spatial data with GeoPandas
04/13 Wolven, Alexander Neural Networks Basics
04/13 Zharyy, Sofia

Final Project Presentation Schedule

We use the same order as the topic presentation for undergraduate final presentation. An introduction on how to use Quarto to prepare presentation slides is available under the templates directory in the classnotes source tree, thank to Zachary Blanchard, which can be used as a template to start with.

Date Presenter
04/15
04/20
04/22
04/27
04/29
Scheduled Final Exam Time

Contributing to the Class Notes

Contribution to the class notes is through a `pull request’.

  • Synchronize your local repo of the classnotes with my classnotes repo.
  • Start a new branch and switch to the new branch.
  • On the new branch, add a qmd file for your presentation
  • If using Python, create and activate a virtual environment with requirements.txt
  • Work on your qmd file, test with quarto render.
  • When satisfied, commit and make a pull request with your quarto files and an updated requirements.txt.

I have added a template file _mysection.qmd as an example, which is includeed in index.qmd. See also how _ethics.qmd is included into 05-ethics_communication.qmd for example.

Here is a checklist to help smooth the process.

  • Get approval for your topic at least one week in advance. Otherwise you loose points.
  • No plagiarism. Under no circumstances should you copy someone else’s notes and use it for your contribution.
  • No yaml header. The whole souce tree is controlled by _quarto.yml.
  • The top heading level of your contribuion is section (##). See existing sections for examples.
  • Keep line width under 80 characters.
  • Include a subsection (###) on further readings.
  • Avoide dependence on external files (e.g., data, images, etc.). Using example datasets that are already in the data folder or that come with Python packages.
  • No usage of copyrighted images.
  • When citing article/book references, use BibTeX (learn how from our sources).
  • Test on your own computer before making a pull request.
  • Send me your presentation two days in advance if you want feedbacks.

For more detailed style guidance, please see notes on statistical writing.

Plagiarism is to be prevented. Remember that these class notes are publicly available online with your names attached. Here are some resources on how to avoid plagiarism.

Homework Logistics

Workflow of Submitting Homework Assisngment

  • Click the GitHub classroom assignment link in HuskCT announcement.
  • Accept the assignment and follow the instructions to an empty repository.
  • Make a clone of the repo at an appropriate folder on your own computer with git clone.
  • Go to this folder, add your qmd source, work on it, and group your changes to different commits.
  • Push your work to your GitHub repo with git push.
  • Create a new release and put the generated pdf file in it for ease of grading.

Requirements

  • Use the repo from Git Classroom to submit your work. See Chapter 2  Project Management.
    • Keep the repo clean (no tracking generated files).
      • Never “Upload” your files; use the git command lines.
      • Make commit message informative (think about the readers).
    • Make at least 10 commits and form a style of frequent small commits.
  • Track quarto sources only in your repo. See Chapter 3  Reproducible Data Science.
  • For the convenience of grading, add your standalone html or pdf output to a release in your repo.
  • For standalone pdf output, you will need to have LaTeX installed.

Quizzes about Syllabus

  • Do I accept late homework?
  • Could you list a few examples of email etiquette?
  • How would you lose style points?
  • Would you use CLI and GUI?
  • What’s the first date on which you have to complete something about your final project?
  • Can you use AI for any task in this course?
  • If you need a reference letter, how could you help me to help you?

Practical Tips

Data analysis

  • Use an IDE so you can play with the data interactively
  • Collect codes that have tested out into a script for batch processing
  • During data cleaning, keep in mind how each variable will be used later
  • No keeping large data files in a repo; assume a reasonable location with your collaborators

Presentation

  • Don’t forget to introduce yourself if there is no moderator.
  • Highlight your research questions and results, not code.
  • Give an outline, carry it out, and summarize.
  • Use your own examples to reduce the risk of plagiarism.

My Presentation Topic (Template)

This section was prepared by John Smith.

Use Markdown syntax. If not clear on what to do, learn from the class notes sources.

  • Pay attention to the sectioning levels.
  • Cite references with their bib key.
  • In examples, maximize usage of data set that the class is familiar with.
  • Could use datasets in Python packages or downloadable on the fly.
  • Test your section by quarto render <filename.qmd>.

Introduction

Here is an overview.

Sub Topic 1

Put materials on topic 1 here

Python examples can be put into python code chunks:

# import pandas as pd

# do something

Sub Topic 2

Put materials on topic 2 here.

Sub Topic 3

Put matreials on topic 3 here.

Conclusion

Put sumaries here.

Further Readings

Put links to further materials.