Introduction to Data Science

STAT 3255/5255 @ UConn, Fall 2024

Authors

Jun Yan and Students in Fall 2024:

Jack Bienveune

Zachary Blanchard

Sara Clokey

Thea Johnson

Rahul Manna

Julia Mazzola

Deyu Xu

Published

November 11, 2024

Preliminaries

The notes were developed with Quarto; for details about Quarto, visit https://quarto.org/docs/books.

This book free and is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.

Sources at GitHub

These lecture notes for STAT 3255/5255 in Fall 2024 represent a collaborative effort between Professor Jun Yan and the students enrolled in the course. This cooperative approach to education was facilitated through the use of GitHub, a platform that encourages collaborative coding and content development. To view these contributions and the lecture notes in their entirety, please visit our GitHub repository at https://github.com/statds/ids-f24.

Students contributed to the lecture notes by submitting pull requests to our GitHub repository. This method not only enriched the course material but also provided students with practical experience in collaborative software development and version control.

For those interested in exploring the lecture notes from the previous years, the Spring 2024, Spring 2023 and Spring 2022 are also publicly accessible. These archives offer insights into the evolution of the course content and the different perspectives brought by successive student cohorts.

Compiling the Classnotes

To reproduce the classnotes output on your own computer, here are the necessary steps:

  • Clone the classnotes repository to an appropriate location on your computer.
  • Set up a Python virtual environment in the root folder of the source.
  • Install all the packages specified in requirements.txt.
  • For some chapters that need to interact with Google map services, you need to save your API key in a file named api_key.txt in the root folder of the source.
  • Render the book with quarto render from the root folder on a terminal; the rendered book will be stored under _book.

Midterm Project

NYC noise complaints made to NYPD in the week of July 4, 2024. See details in the exercises.

Final Project

Students are encouraged to start designing their final projects from the beginning of the semester. There are many open data that can be used. Here is a list of data challenges that you may find useful:

If you work on sports analytics, you are welcome to submit a poster to UConn Sports Analytics Symposium (UCSAS) 2024.

Adapting to Rapid Skill Acquisition

In this course, students are expected to rapidly acquire new skills, a critical aspect of data science. To emphasize this, consider this insightful quote from VanderPlas (2016):

When a technologically-minded person is asked to help a friend, family member, or colleague with a computer problem, most of the time it’s less a matter of knowing the answer as much as knowing how to quickly find an unknown answer. In data science it’s the same: searchable web resources such as online documentation, mailing-list threads, and StackOverflow answers contain a wealth of information, even (especially?) if it is a topic you’ve found yourself searching before. Being an effective practitioner of data science is less about memorizing the tool or command you should use for every possible situation, and more about learning to effectively find the information you don’t know, whether through a web search engine or another means.

This quote captures the essence of what we aim to develop in our students: the ability to swiftly navigate and utilize the vast resources available to solve complex problems in data science. Examples tasks are: install needed software (or even hardware); search and find solutions to encountered problems.

Wishlist

This is a wish list from all members of the class (alphabetical order, last name first, comma, then first name). Here is an example.

  • Yan, Jun
    • Make practical data science tools accessible to undergraduates
    • Co-develop a Quarto book in collaboration with the students
    • Train students to participate real data science competitions

Add yours through a pull request; note the syntax of nested list in Markdown.

  • Akach, Suha
    • Challenge and push myself to be better at python and all its libraries.
    • Be confident in my abilities of programming and making statistical inferences that are correct.
    • Be able to create my own personal project in class on time.
  • Astle, Jaden
    • I’ve used Git before, but I’d like to become more comfortable using it and get more used to different issues that arise.
    • I’d like to learn more effective ways to “tell the story” of data analysis and show empowering visualizations.
    • I’d like to explore more methods that professional data scientists use in their model trainings to share with UConn’s Data Science Club.
  • Babiec, Owen
    • Become more comfortable with Git and Github and their applications
    • Better understand the Data Science pipeline and workflow
    • Learn how to show my skills I have learned in this class during interviews
  • Baptista, Stef
    • Develop a project/presentation suitible enough for industry
    • Improve on my data science skills regarding pandas and numpy
    • Understanding the scope of packages in python as a language
  • Bienvenue, Jack
    • Learn professional visualization techniques, particularly for geospatial data
    • Foster a high level of working knowledge of Git
    • Create a small portfolio of examples and projects for later reference
  • Blanchard, Zachary
    • Gain experience working and collaborating on projects in Git
    • Improve computer programming skills and familiarity with Python
    • Teach other students about creating presentations using Quarto
  • Borowski, Emily
    • Gain a greater understanding of Quarto and GitHub
    • Become more comfortable with my coding abilities
    • Acquire a deeper understanding of data science
  • Clokey, Sara
    • Become more familiar with GitHub and Quarto
    • Execute a data science project from start to finish
  • Desroches, Melanie
    • Explore the field of data science as a possible future career
    • Develope data science and machine learning skills
    • Become better at programming with Python and using Git/GitHub
  • Febles, Xavier
    • Gain a further understanding of GitHub
    • Develop data visualization skills
    • Learn applications of skills learned in previous courses
  • Jha, Aansh
    • Be a better student of the data science field
    • Hone git to work in colabarative workspaces
    • Learn better methods in data visualization
  • Johnson, Dorothea
    • Enter data science contests
    • Familiarize myself with using Python for data Science
    • Develop a proficiency in Github
  • Kashalapov, Olivia
    • Better understand neural networks
    • Machine learning utilizing Python
    • Creating and analyzing predictive models for informed decision making
  • Manna, Rahul
    • Use knowledge gained and skills developed in class to study real-world problems such as climate change.
    • Obtain a basic understanding of machine learning
  • Mazzola, Julia
    • Become proficient in Git and Github.
    • Have a better understanding of data science best practices and techniques.
    • Deepen my knowledge of Python programming concepts and libraries.
  • Paricharak, Aditya
    • Master Commandline Interface
    • Apply my statistical knowladge and skills to course work
    • Understand how to work with datasets
  • Parvez, Mohammad Shahriyar
    • Familiarizing myself with GitHub to effectively track and manage the entire data analysis process.
    • Adopting Quarto for improved documentation of my data workflows.
    • Exploring advanced techniques for data analysis and visualization.
    • Developing my personal Git repository and publishing data projects as a professional website.
  • Tan, Qianruo
    • Learn how to use GitHub, and create my own page
    • Get a good grade on this class
    • Learn more about how to processing data
  • Xu, Deyu
    • Be proficient in using Python to process data.
    • Learn the basics of machine learning.
    • Have a basic understanding of data scienc.
    • Lay a solid foundation for GNN and Bayes neural network.

Presentation Orders

The topic presentation order is set up in class.

with open('rosters/3255.txt', 'r') as file:
    ug = [line.strip() for line in file]
with open('rosters/5255.txt', 'r') as file:
    gr = [line.strip() for line in file]
presenters = ug + gr
target = "Blanchard"  # pre-arranged 1st presenter
presenters = [name for name in presenters if target not in name]

import random
## seed jointly set by the class
random.seed(5347 + 2896 + 9050 + 1687 + 63)
random.sample(presenters, len(presenters))
## random.shuffle(presenters) # This would shuffle the list in place
['Xu,Deyu',
 'Clokey,Sara Karen',
 'Johnson,Dorothea Trixie',
 'Febles,Xavier Milan',
 'Cai,Yizhan',
 'Bienvenue,Jack Noel',
 'Mazzola,Julia Cecelia',
 'Akach,Suha',
 'Manna,Rahul',
 'Astle,Jaden Bryce',
 'Kashalapov,Olivia',
 'Borowski,Emily Helen',
 'Tan,Qianruo',
 'Desroches,Melanie',
 'Paricharak,Aditya Sushant',
 'Jha,Aansh',
 'Babiec,Owen Thomas',
 'Baptista,Stef Clare',
 'Parvez,Mohammad Shahriyar']

Switching slots is allowed as long as you find someone who is willing to switch with you. In this case, make a pull request to switch the order and let me know.

You are welcome to choose a topic that you are interested the most, subject to some order restrictions. For example, decision tree should be presented before random forest or extreme gradient boosting. This justifies certain requests for switching slots.

Course Logistics

Presentation Task Board

Here are some example tasks:

  • Making presentations with Quarto
  • Data science ethics
  • Data science communication skills
  • Import/Export data
  • Arrow as a cross-platform data format
  • Database operation with Structured query language (SQL)
  • Grammer of graphics
  • Handling spatial data
  • Visualize spatial data in a Google map
  • Animation
  • Classification and regression trees
  • Support vector machine
  • Random forest
  • Naive Bayes
  • Bagging vs boosting
  • Neural networks
  • Deep learning
  • TensorFlow
  • Autoencoders
  • Reinforcement learning
  • Calling C/C++ from Python
  • Calling R from Python and vice versa
  • Developing a Python package

Please use the following table to sign up.

Date Presenter Topic
09/11 Zachary Blanchard Making presentation with Quarto
09/16 Deyu Xu Import/Export data
09/18 Sara Clokey Communications in data science
09/23 Dorathea Johnson Database with SQL
09/25 Xavier Febles Statistical tests
09/30 Jack Bienvenue Visualizing spatial data in a Google Map
10/02 Julia Mazzola Data Visualization with Plotnine
10/07 Suha Akach Naive Bayes classifier
10/09 Rahul Manna Animation
10/23 Jaden Astle Classification and regression trees
10/23 Olivia Kashalapov Synthetic Minority Oversampling Technique (SMOTE)
10/28 Data science alumni panel
10/30 Emily Borowski Random Forest
10/30 Aditya Paricharak Neural Networks
11/04 Melanie Desroches
11/06 Qianruo Tan Reinforcement Learning
11/11 Aansh Jha K-means clustering
11/11 Owen Babiec Calling R from Python and vice versa
11/13 Stef Baptista
11/13 Mohammad Parvez Developing a Python package

Final Project Presentation Schedule

We use the same order as the topic presentation for undergraduate final presentation. An introduction on how to use Quarto to prepare presentation slides is availabe under the templates directory in the classnotes source tree, thank to Zachary Blanchard, which can be used as a template to start with.

Date Presenter
11/18 Sara Clokey; Dorothea Johnson; Xavier Febles; Jack Bienvenue
11/20 Julia Mazzola; Suha Akach; Rahul Manna; Jaden Astle
12/02 Olivia Kashalapov; Emily Borowski;Qianruo Tan; Melanie Desroches
12/04 Aditya Paricharak; Aansh Jha; Owen Babiec; Stef Baptista

Contributing to the Class Notes

Contribution to the class notes is through a `pull request’.

  • Start a new branch and switch to the new branch.
  • On the new branch, add a qmd file for your presentation
  • If using Python, create and activate a virtual environment with requirements.txt
  • Edit _quarto.yml add a line for your qmd file to include it in the notes.
  • Work on your qmd file, test with quarto render.
  • When satisfied, commit and make a pull request with your quarto files and an updated requirements.txt.

I have added a template file mysection.qmd and a new line to _quarto.yml as an example.

For more detailed style guidance, please see my notes on statistical writing.

Plagiarism is to be prevented. Remember that these class notes are publicly available online with your names attached. Here are some resources on how to avoid plagiarism. In particular, in our course, one convenient way to avoid plagiarism is to use our own data (e.g., NYC Open Data). Combined with your own explanation of the code chunks, it would be hard to plagiarize.

Homework Requirements

  • Use the repo from Git Classroom to submit your work. See Section 2  Project Management.
    • Keep the repo clean (no tracking generated files).
      • Never “Upload” your files; use the git command lines.
      • Make commit message informative (think about the readers).
    • Make at least 10 commits and form a style of frequent small commits.
  • Use quarto source only. See 3  Reproducible Data Science.
  • For the convenience of grading, add your standalone html or pdf output to a release in your repo.
  • For standalone pdf output, you will need to have LaTeX installed.

Practical Tips

Data analysis

  • Use an IDE so you can play with the data interactively
  • Collect codes that have tested out into a script for batch processing
  • During data cleaning, keep in mind how each variable will be used later
  • No keeping large data files in a repo; assume a reasonable location with your collaborators

Presentation

  • Don’t forget to introduce yourself if there is no moderator.
  • Highlight your research questions and results, not code.
  • Give an outline, carry it out, and summarize.
  • Use your own examples to reduce the risk of plagiarism.

My Presentation Topic (Template)

Introduction

Put an overview here. Use Markdown syntax.

Sub Topic 1

Put materials on topic 1 here

Python examples can be put into python code chunks:

import pandas as pd

# do something

Sub Topic 2

Put materials on topic 2 here.

Sub Topic 3

Put matreials on topic 3 here.

Conclusion

Put sumaries here.