Introduction to Data Science

STAT 3255/5255 @ UConn, Spring 2024

Authors

Jun Yan and Students in Spring 2024:

Jack Dennison, Matt Elliott, Joshua Lee, Ge Li, Olivia Massad

Abgail Mori, Leon Nguyen, Pratham Patel, Isabelle Perez

Alex Pugh, William Qualls, Weijia Wu, Vincent Xie

Emmanuel Yankson, Xingye Zhang

Published

May 1, 2024

Preliminaries

The notes were developed with Quarto; for details about Quarto, visit https://quarto.org/docs/books.

Sources at GitHub

These lecture notes for STAT 3255/5255 in Spring 2024 represent a collaborative effort between Professor Jun Yan and the students enrolled in the course. This cooperative approach to education was facilitated through the use of GitHub, a platform that encourages collaborative coding and content development. To view these contributions and the lecture notes in their entirety, please visit our Spring 2024 repository at https://github.com/statds/ids-s24.

Students contributed to the lecture notes by submitting pull requests to our dedicated GitHub repository. This method not only enriched the course material but also provided students with practical experience in collaborative software development and version control.

For those interested in exploring the lecture notes from the previous years, the Spring 2023 and Spring 2022 are both publicly accessible. These archives offer valuable insights into the evolution of the course content and the different perspectives brought by successive student cohorts.

Midterm Project

Our mid-term project on rodent sightings in New York City was showcased in a virtual session at the NYC Open Data Week 2024 entitled “Landscape of Rodent Sightings in New York City: A Data Science Showcase” on Wednesday, March 20, 2024. The project description is at Section 12  Exercises.

The presenters were

Final Project

Students are encouraged to start designing their final projects from the beginning of the semester. There are many open data that can be used. Here is a list of data challenges that you may find useful:

If you work on sports analytics, you are welcome to submit a poster to UConn Sports Analytics Symposium (UCSAS) 2024.

Adapting to Rapid Skill Acquisition

In this course, students are expected to rapidly acquire new skills, a critical aspect of data science. To emphasize this, consider this insightful quote from VanderPlas (2016):

When a technologically-minded person is asked to help a friend, family member, or colleague with a computer problem, most of the time it’s less a matter of knowing the answer as much as knowing how to quickly find an unknown answer. In data science it’s the same: searchable web resources such as online documentation, mailing-list threads, and StackOverflow answers contain a wealth of information, even (especially?) if it is a topic you’ve found yourself searching before. Being an effective practitioner of data science is less about memorizing the tool or command you should use for every possible situation, and more about learning to effectively find the information you don’t know, whether through a web search engine or another means.

This quote captures the essence of what we aim to develop in our students: the ability to swiftly navigate and utilize the vast resources available to solve complex problems in data science.

Wishlist

This is a wish list from all members of the class (alphabetical order, last name first, comma, then first name). Add yours through a pull request; note the syntax of nested list in Markdown.

  • Chugh, Charitarth
    • Get better at analyzing data/features
    • Learn about more xgboost & gradient boosted trees.
  • Dennison, Jack
    • Learn how to use Git and GitHub
    • Be able to apply my skills in Python and Git to data analytics tasks
  • Elliott, Matt
    • Faciliate myself into becoming a Data Scientist
    • Learn new skills such as Quarto and GitHub
  • Lee, Joshua
    • Improve model optimization techniques
    • learn how to conduct better feature engineering
    • learn how to perform better model selection and feature selection
    • learn how to deploy ml models and processes to the cloud
  • Mori, Abigail
    • Become proficient using Git
    • Learn how to properly communiacte statistical evidence and findings
  • Massad, Olivia
    • Be able to use Git effectively
    • Gain knowledge about Data Science and its importance
  • Nguyen, Leon
    • Become proficient in utilizing Git and GitHub workflow processes
    • Develop proficiency in Quarto and Python packages
    • Create a data science project start to finish for portfolio work
  • Patel, Pratham
    • Become more proficient and efficient with GitHub and Python
    • Get a deeper understanding and appreciate of the Data Science workflow
    • Understand collaboration and project creation on GitHub
  • Perez, Isabelle
    • Become comfortable working with git and quarto
    • Learn data management strategies and the relevant programming skills
  • Pugh, Alex
    • Increase my knowledge of Git and Python
    • Learn to efficiently clean a data set
  • Qualls, William
    • Better understand the Data Science Pipeline
    • Gain practical knowledge with tools such as Github that aren’t covered in other classes
  • Schober, Henry
    • Be more proficient in Git and Python
    • Deepen my understanding of Data Science
  • Taki, William
    • Get comfortable with Git and Python
    • Use the learnings from this class to help with STAT 33494W
  • Woo, Madison
    • Be able to comfortably use Git and Python
    • Learn about project managment and data science
  • Xie, Vincent
    • Become more proficient with Git.
    • Learn how to create a proper data science project.
    • Be introduced to core concepts in data science.
  • Yan, Jun
    • Make data science more accessible to undergraduates
    • Co-develop a Quarto book in collaboration with the students
    • Train students to participate real data science competitions
  • Yankson, Emmanuel
    • Get better with python
    • Get an A in STAT 3255
  • Zhang, Xingye
    • Get better with computers.
    • Get an A in STAT 3255.

Presentation Orders

The topic presentation order is set up in class.

with open('rosters/3255.txt', 'r') as file:
    ug = [line.strip() for line in file]
with open('rosters/5255.txt', 'r') as file:
    gr = [line.strip() for line in file]
presenters = ug + gr

import random
random.seed(4737 + 8852 + 3196 + 2344 + 47) # jointly set by the class on 01/24/2024
random.sample(presenters, len(presenters))
## random.shuffle(presenters) # This would shuffle the list in place
['Elliott,Matt A',
 'Wu,Weijia',
 'Lek,Victor Khun',
 'Taki,William Hiroyasu',
 'Schober,Henry',
 'Lee,Joshua Jian',
 'Patel,Pratham Subhas',
 'Li,Ge',
 'Zhang,Xingye',
 'Dennison,Jack Thomas',
 'Massad,Olivia Grace',
 'Perez,Isabelle Daenerys Halpine',
 'Yankson,Emmanuel Opoku',
 'Li,David',
 'Mori,Abigail Kanoelani Shim',
 'Nguyen,Leon Duc',
 'Pugh,Alex',
 'Chugh,Charitarth',
 'Xie,Vincent',
 'Vijayaraghavendra,Jyothsna',
 'Qualls,William Wayne',
 'Woo,Madison Nicole',
 'Hook,Braedon',
 'Chowaniec,Amelia Elizabeth']

Switching slots is allowed as long as you find someone who is willing to switch with you. In this case, make a pull request to switch the order and let me know.

You are welcome to choose a topic that you are interested the most, subject to some order restrictions. For example, decision tree should be presented before random forest or extreme gradient boosting. This justifies certain requests for switching slots.

Course Logistics

Presentation Task Board

Here are some example tasks:

  • Data science ethics
  • Data science communication skills
  • Import/Export data
  • Arrow as a cross-platform data format
  • Database operation with Structured query language (SQL)
  • Descriptive statistics
  • Statistical hypothesis tests
  • Statistical modeling
  • Data visualization
  • Accessing census and ACS data
  • Grammer of graphics
  • Handling spatial data
  • Visualize spatial data in a Google map
  • Animation
  • Classification and regression trees
  • Support vector machine
  • Random forest
  • Naive Bayes
  • Bagging vs boosting
  • Neural networks
  • Deep learning
  • TensorFlow
  • Autoencoders
  • Reinforcement learning
  • Calling C/C++ from Python
  • Calling R from Python and vice versa
  • Developing a Python package

Please use the following table to sign up.

Date Presenter Topic
02/07 Matt Elliott Data science communication skills
02/12 Dr. Haim Bar Database management
02/19 Willam Taki Visualization with matplotlib
02/19 Joshua Lee Descriptive Statistics
02/07 Weijia Wu Visualizaiton with matplotlib and seaborn
02/21 Pratham Patel Handling spatial data with geopandas
02/21 Olivia Massad Grammar of Graphics plotnine
02/26 Xingye Zhang Data visualizing NYC rodent dataset
02/28 Jack Dennison Geographic Data Analysis
02/28 Isabelle Perez Statistical hypothesis tests scypy.stats
03/04 Emmanuel Yankson Random Forest
03/04 David Li
03/06 Abigail Mori Accessing census and ACS data
03/06 Leon Nguyen Statistical Modeling with statsmodels
03/25 Alex Pugh Time Series Analysis
03/25 Charitath Chugh PyTorch
03/27
03/27 Ge Li Animation
04/01 William Qualls Web Scraping
04/01 Vincent Xie Database Operations with SQL
04/03 Braedon Hook Long short-term memory (LSTM) network
04/03 Madison Woo Calling C/C++ from Python
04/08
04/08
04/10
04/10

Final Project Presentation Schedule

We use the same order as the topic presentation for undergraduate final presentation.

Date Presenter
04/15 Matt Elliott; Weijia Wu; William Taki; Joshua Lee; Pratham Patel
04/17 Olivia Massad; Ge Li; Xingye Zhang; Isabelle Perez
04/22 Emmanual Yankson; Davi Li; Abigail Mori; Leon Nguyen; Alex Pugh
04/24 Jack Dennison; Charitath Chugh; Vincent Xie; Madison Woo; Braedon Hook

Contributing to the Class Notes

Contribution to the class notes is through a `pull request’.

  • Start a new branch and switch to the new branch.
  • On the new branch, add a qmd file for your presentation
  • If using Python, create and activate a virtual environment with requirements.txt
  • Edit _quarto.yml add a line for your qmd file to include it in the notes.
  • Work on your qmd file, test with quarto render.
  • When satisfied, commit and make a pull request with your quarto files and an updated requirements.txt.

I have added a template file mysection.qmd and a new line to _quarto.yml as an example.

For more detailed style guidance, please see my notes on statistical writing.

Plagiarism is to be prevented. Remember that these class notes are publicly available online with your names attached. Here are some resources on []how to avoid plagiarism](https://usingsources.fas.harvard.edu/how-avoid-plagiarism). In particular, in our course, one convenient way to avoid plagiarism is to use our own data (e.g., NYC Open Data). Combined with your own explanation of the code chunks, it would be hard to plagiarize.

Homework Requirements

  • Use the repo from Git Classroom to submit your work. See Section 2  Project Management.
    • Keep the repo clean (no tracking generated files).
    • Never “Upload” your files; use the git command lines.
    • Make commit message informative (think about the readers).
  • Use quarto source only. See 3  Reproducibile Data Science.
  • For the convenience of grading, add your html output to a release in your repo.

Practical Tips

Data analysis

  • Use an IDE so you can play with the data interactively
  • Collect codes that have tested out into a script for batch processing
  • During data cleaning, keep in mind how each variable will be used later
  • No keeping large data files in a repo; assume a reasonable location with your collaborators

Presentation

  • Don’t forget to introduce yourself if there is no moderator.
  • Highlight your research questions and results, not code.
  • Give an outline, carry it out, and summarize.
  • Use your own examples to reduce the risk of plagiarism.

My Presentation Topic (Template)

Introduction

Put an overview here. Use Markdown syntax.

Sub Topic 1

Put materials on topic 1 here

Python examples can be put into python code chunks:

import pandas as pd

# do something

Sub Topic 2

Put materials on topic 2 here.

Sub Topic 3

Put matreials on topic 3 here.

Conclusion

Put sumaries here.