These lecture notes for STAT 3255/5255 in Spring 2024 represent a collaborative effort between Professor Jun Yan and the students enrolled in the course. This cooperative approach to education was facilitated through the use of GitHub, a platform that encourages collaborative coding and content development. To view these contributions and the lecture notes in their entirety, please visit our Spring 2024 repository at https://github.com/statds/ids-s24.
Students contributed to the lecture notes by submitting pull requests to our dedicated GitHub repository. This method not only enriched the course material but also provided students with practical experience in collaborative software development and version control.
For those interested in exploring the lecture notes from the previous years, the Spring 2023 and Spring 2022 are both publicly accessible. These archives offer valuable insights into the evolution of the course content and the different perspectives brought by successive student cohorts.
Students are encouraged to start designing their final projects from the beginning of the semester. There are many open data that can be used. Here is a list of data challenges that you may find useful:
In this course, students are expected to rapidly acquire new skills, a critical aspect of data science. To emphasize this, consider this insightful quote from VanderPlas (2016):
When a technologically-minded person is asked to help a friend, family member, or colleague with a computer problem, most of the time it’s less a matter of knowing the answer as much as knowing how to quickly find an unknown answer. In data science it’s the same: searchable web resources such as online documentation, mailing-list threads, and StackOverflow answers contain a wealth of information, even (especially?) if it is a topic you’ve found yourself searching before. Being an effective practitioner of data science is less about memorizing the tool or command you should use for every possible situation, and more about learning to effectively find the information you don’t know, whether through a web search engine or another means.
This quote captures the essence of what we aim to develop in our students: the ability to swiftly navigate and utilize the vast resources available to solve complex problems in data science.
Wishlist
This is a wish list from all members of the class (alphabetical order, last name first, comma, then first name). Add yours through a pull request; note the syntax of nested list in Markdown.
Chugh, Charitarth
Get better at analyzing data/features
Learn about more xgboost & gradient boosted trees.
Dennison, Jack
Learn how to use Git and GitHub
Be able to apply my skills in Python and Git to data analytics tasks
Elliott, Matt
Faciliate myself into becoming a Data Scientist
Learn new skills such as Quarto and GitHub
Lee, Joshua
Improve model optimization techniques
learn how to conduct better feature engineering
learn how to perform better model selection and feature selection
learn how to deploy ml models and processes to the cloud
Mori, Abigail
Become proficient using Git
Learn how to properly communiacte statistical evidence and findings
Massad, Olivia
Be able to use Git effectively
Gain knowledge about Data Science and its importance
Nguyen, Leon
Become proficient in utilizing Git and GitHub workflow processes
Develop proficiency in Quarto and Python packages
Create a data science project start to finish for portfolio work
Patel, Pratham
Become more proficient and efficient with GitHub and Python
Get a deeper understanding and appreciate of the Data Science workflow
Understand collaboration and project creation on GitHub
Perez, Isabelle
Become comfortable working with git and quarto
Learn data management strategies and the relevant programming skills
Pugh, Alex
Increase my knowledge of Git and Python
Learn to efficiently clean a data set
Qualls, William
Better understand the Data Science Pipeline
Gain practical knowledge with tools such as Github that aren’t covered in other classes
Schober, Henry
Be more proficient in Git and Python
Deepen my understanding of Data Science
Taki, William
Get comfortable with Git and Python
Use the learnings from this class to help with STAT 33494W
Woo, Madison
Be able to comfortably use Git and Python
Learn about project managment and data science
Xie, Vincent
Become more proficient with Git.
Learn how to create a proper data science project.
Be introduced to core concepts in data science.
Yan, Jun
Make data science more accessible to undergraduates
Co-develop a Quarto book in collaboration with the students
Train students to participate real data science competitions
Yankson, Emmanuel
Get better with python
Get an A in STAT 3255
Zhang, Xingye
Get better with computers.
Get an A in STAT 3255.
Presentation Orders
The topic presentation order is set up in class.
withopen('rosters/3255.txt', 'r') asfile: ug = [line.strip() for line infile]withopen('rosters/5255.txt', 'r') asfile: gr = [line.strip() for line infile]presenters = ug + grimport randomrandom.seed(4737+8852+3196+2344+47) # jointly set by the class on 01/24/2024random.sample(presenters, len(presenters))## random.shuffle(presenters) # This would shuffle the list in place
Switching slots is allowed as long as you find someone who is willing to switch with you. In this case, make a pull request to switch the order and let me know.
You are welcome to choose a topic that you are interested the most, subject to some order restrictions. For example, decision tree should be presented before random forest or extreme gradient boosting. This justifies certain requests for switching slots.
Course Logistics
Presentation Task Board
Here are some example tasks:
Data science ethics
Data science communication skills
Import/Export data
Arrow as a cross-platform data format
Database operation with Structured query language (SQL)
Descriptive statistics
Statistical hypothesis tests
Statistical modeling
Data visualization
Accessing census and ACS data
Grammer of graphics
Handling spatial data
Visualize spatial data in a Google map
Animation
Classification and regression trees
Support vector machine
Random forest
Naive Bayes
Bagging vs boosting
Neural networks
Deep learning
TensorFlow
Autoencoders
Reinforcement learning
Calling C/C++ from Python
Calling R from Python and vice versa
Developing a Python package
Please use the following table to sign up.
Date
Presenter
Topic
02/07
Matt Elliott
Data science communication skills
02/12
Dr. Haim Bar
Database management
02/19
Willam Taki
Visualization with matplotlib
02/19
Joshua Lee
Descriptive Statistics
02/07
Weijia Wu
Visualizaiton with matplotlib and seaborn
02/21
Pratham Patel
Handling spatial data with geopandas
02/21
Olivia Massad
Grammar of Graphics plotnine
02/26
Xingye Zhang
Data visualizing NYC rodent dataset
02/28
Jack Dennison
Geographic Data Analysis
02/28
Isabelle Perez
Statistical hypothesis tests scypy.stats
03/04
Emmanuel Yankson
Random Forest
03/04
David Li
03/06
Abigail Mori
Accessing census and ACS data
03/06
Leon Nguyen
Statistical Modeling with statsmodels
03/25
Alex Pugh
Time Series Analysis
03/25
Charitath Chugh
PyTorch
03/27
03/27
Ge Li
Animation
04/01
William Qualls
Web Scraping
04/01
Vincent Xie
Database Operations with SQL
04/03
Braedon Hook
Long short-term memory (LSTM) network
04/03
Madison Woo
Calling C/C++ from Python
04/08
04/08
04/10
04/10
Final Project Presentation Schedule
We use the same order as the topic presentation for undergraduate final presentation.
Date
Presenter
04/15
Matt Elliott; Weijia Wu; William Taki; Joshua Lee; Pratham Patel
04/17
Olivia Massad; Ge Li; Xingye Zhang; Isabelle Perez
04/22
Emmanual Yankson; Davi Li; Abigail Mori; Leon Nguyen; Alex Pugh
04/24
Jack Dennison; Charitath Chugh; Vincent Xie; Madison Woo; Braedon Hook
Contributing to the Class Notes
Contribution to the class notes is through a `pull request’.
Start a new branch and switch to the new branch.
On the new branch, add a qmd file for your presentation
If using Python, create and activate a virtual environment with requirements.txt
Edit _quarto.yml add a line for your qmd file to include it in the notes.
Work on your qmd file, test with quarto render.
When satisfied, commit and make a pull request with your quarto files and an updated requirements.txt.
I have added a template file mysection.qmd and a new line to _quarto.yml as an example.
For more detailed style guidance, please see my notes on statistical writing.
Plagiarism is to be prevented. Remember that these class notes are publicly available online with your names attached. Here are some resources on []how to avoid plagiarism](https://usingsources.fas.harvard.edu/how-avoid-plagiarism). In particular, in our course, one convenient way to avoid plagiarism is to use our own data (e.g., NYC Open Data). Combined with your own explanation of the code chunks, it would be hard to plagiarize.
Homework Requirements
Use the repo from Git Classroom to submit your work. See Section 2 Project Management.
Keep the repo clean (no tracking generated files).
Never “Upload” your files; use the git command lines.
Make commit message informative (think about the readers).