1  Introduction

1.1 What Is Data Science?

One widely accepted concept is the three pillars of data science: mathematics/statistics, computer science, and domain knowledge.

In her 2014 Presidential Address, Prof. Bin Yu, then President of the Institute of Mathematical Statistics, gave an interesting definition: \[ \mbox{Data Science} = \mbox{S}\mbox{D}\mbox{C}^3, \] where S is Statistics, D is domain/science knowledge, and the three C’s are computing, collaboration/teamwork, and communication to outsiders.

1.2 Expectations from This Course

  • Proficiency in project management with Git.
  • Proficiency in project report with Quarto.
  • Hands-on experience with real-world data science project.
  • Competency in using Python and its extensions for data science.
  • Full grasp of the meaning of the results from data science algorithms.
  • Basic understanding the principles of the data science methods.

1.3 Computing Environment

All setups are operating system dependent. As soon as possible, stay away from Windows. Otherwise, good luck (you will need it).

1.3.1 Command Line Interface

On Linux or MacOS, simply open a terminal.

On Windows, several options can be considered.

To jump start, here is a tutorial: Ubunto Linux for beginners.

At least, you need to know how to handle files and traverse across directories. The tab completion and introspection supports are very useful.

1.3.2 Python

Set up Python on your computer:

  • Python 3.
  • Python package manager miniconda or pip.
  • Integrated Development Environment (IDE) (Jupyter Notebook; RStudio; VS Code; Emacs; etc.)

I will be using IPython and Jupyter Notebook in class.

Readability is important! Check your Python coding styles against the recommended styles: https://peps.python.org/pep-0008/. A good place to start is the Section on “Code Lay-out”.

Online books on Python for data science:

  1. “Python for Data Analysis: Data Wrangling with Pan- das, NumPy, and IPython.” Third Edition, by Wes McK- inney, O’Reilly Media, 2022.

1.4 Data Challenges

1.5 Wishlist

This is a wish list from all members of the class (alphabetical order). Add yours; note the syntax of nested list in Markdown.

  • Alsubai, Nadia
    • Become familiar with both machine learning and deep learning
    • Become proficient in at least one machine or deep learning library for Python
  • Bedard, Kaitlyn
    • Learn how to use Git proficiently
    • Gain practical experience using data science methods
    • Learn to use python libraries for data science
  • Cheu, Catherine
    • Learn more about command prompts in Git Bash
    • Learn more about data science principles
    • Learn better programming techniques in Python
    • Become better in simulation techniques.
  • Ho, Garrick
    • I want to be more confident with git
    • Be able to create a data science project
    • Learn more about data science
  • Jones, Courtney
    • Become proficient in Git and VS Code
    • Deepen my understanding of data science
    • Become comfortable with the Terminal on my computer
  • Karandikar, Shivaram
    • Understand the workflow and life cycle of a data science project.
    • Learn how to code efficiently.
  • Lunetta, Giovanni
    • Learn to properly clean a dataset
    • Become proficient in building a machine learning model
  • Mastrorilli, Ginamarie
    • Become more comfortable with git.
    • Increase my ability to learn new programs.
  • Nguyen, Christine
    • Become proficient in Git and apply it.
    • Build my Python programming skills further.
  • Nhan, Nathan
    • Properly learn how to use Git
    • Increase my understanding of data science and data collection
  • Noel, Luke
    • Become proficient in Git/Github
    • Learn data science techniques like deep/machine learning, etc.
  • Parchekani, Kian
    • Learn how to use Git, Quarto
    • Get an introduction to data science
    • Learn if a career in this field is right for me
    • Collaborate with others and gain project experience
  • Shen, Tong
    • Get familiar with machine learning
    • Learn how to deal with big secondary data
    • Gain some experiences with python
  • Sullivan, Collin
    • I would like to learn to be able to use data science well enough to get a job in the field
    • Discover if this area of statistics is one that I am passionate about
    • Gain some project experience that I can cite or reference in interviews
    • Be able to speak intelligently about data science and it’s facets
    • Gain practical experience
  • Wang, Chaoyang
    • Learn Deep Learning and application on Finance
  • Yan, Jun
    • Make data science more accessible to undergraduates
    • Co-develop a Quarto Book in collaboration with the students
  • Yang Kang, Chua
    • Learn more about git and github
    • Apply Data Science skill to my research
    • Develop a new machine learning model
  • Yeung, Shannon
    • Learn more about Git
    • get aa more well rounded understanding of data science
  • Yi, Guanghong
    • Know more about Data Science, and what Data Scientists do
    • Do one(or more) real life data science project, and gain some practical experience.
  • Zheng, Michael
    • Become more comfortable with git
    • Learn how to complete a data science project from start to finish

1.5.1 Presentation Orders

The topic presentation order is set up in class.

presenters = ["Alsubai, Nadia",
              "Bedard, Kaitlyn",
              "Cheu, Catherine",
              "Chua, Yang Kang",
              "Cummins, Patrick",
              "Ho, Garrick",
              "Jones, Courtney",
              "Karandikar, Shivaram",
              "Lunetta, Giovanni",
              "Mastrorilli, Ginamarie",
              "Nguyen, Christine",
              "Nhan, Nathan",
              "Noel, Luke",
              "Parchekani, Kian",
              "Shen, Tong",
              "Sullivan, Colin",
              "Wang, Chaoyang",
              "Whitney, William",
              "Yeung, Shannon",
              "Yi, Guanghong",
              "Zheng, Michael"]

import random
random.seed(71323498112697523) # jointly set by the class on 01/30/2023
random.sample(presenters, len(presenters))
['Cheu, Catherine',
 'Ho, Garrick',
 'Mastrorilli, Ginamarie',
 'Yi, Guanghong',
 'Karandikar, Shivaram',
 'Chua, Yang Kang',
 'Jones, Courtney',
 'Sullivan, Colin',
 'Shen, Tong',
 'Alsubai, Nadia',
 'Yeung, Shannon',
 'Bedard, Kaitlyn',
 'Nhan, Nathan',
 'Parchekani, Kian',
 'Noel, Luke',
 'Whitney, William',
 'Wang, Chaoyang',
 'Nguyen, Christine',
 'Cummins, Patrick',
 'Zheng, Michael',
 'Lunetta, Giovanni']

1.6 Presentation Task Board

Here are some example tasks:

  • Import/Export data
  • Descriptive statistics
  • Statistical hypothesis tests scypy.stats
  • Model formulas with patsy
  • Statistical models with statsmodels
  • Data visualization with matplotlib
  • Grammer of graphics for python plotnine
  • Handling spatial data with geopandas
  • Show your Data in a Google map with gmplot
  • Random forest
  • Naive Bayes
  • Bagging vs boosting
  • Calling C/C++ from Python
  • Calling R from Python and vice versa
  • Develop a Python module
  • Neural networks
  • Deep learning
  • TensorFlow
  • Autoencoders
  • Reinforcement learning

Please use the following table to sign up.

Date Presenter Topic
02/06 Cheu, Catherine Visualization with matplotlib
02/08 Ho, Garrick Pandas part 1
02/13 Mastrorilli, Ginamarie Pandas part 2
02/13 Yi, Guanghong Grammer of graphics for python plotnine
02/15 Karandikar, Shivaram Text processing with nltk
02/20 Chua, Yang Kang Support Vector Machine with scikit-learn
02/20 Jones, Courtney Descriptive Statistics
02/22 Sullivan, Colin Statistical hypothesis tests scypy.stats
02/27 Shen, Tong Decision tree with scikit-learn
03/01 Bedard, Kaitlyn Handling spatial data with geopandas
03/06 Nhan, Nathan Bagging vs boosting
03/08 Parchekani, Kian Naive Bayes
03/20 Noel, Luke Plotting on maps with gmplot
03/20 Whitney, William Autoencoder
03/27 Nguyen, Christine Calling R from Python and vice versa
03/27 Cummins, Patrick K-means clustering
03/29 Zheng, Michael Web Scraping with Selenium
04/03 Lunetta, Giovanni Softmax Regression & Neural Networks with TensorFlow

1.7 Final Project Presentation Schedule

We use the same order as the topic presentation for undergraduate final presentation.

Date Presenter
04/17 Ho, Garrick
04/17 Mastrorilli, Ginamarie
04/17 Yi, Guanghong
04/17 Karandikar, Shivaram
04/19 Jones, Courtney
04/19 Sullivan, Colin
04/19 Bedard, Kaitlyn
04/19 Nhan, Nathan
04/24 Parchekani, Kian
04/24 Noel, Luke
04/24 Whitney, William
04/24 Nguyen, Christine
04/26 Cummins, Patrick
04/26 Zheng, Michael
04/26 Lunetta, Giovanni

I encourage you to work on NYC open data or other open data for your projects and submit an abstract to the Government Advances in Statistical Programming (GASP) 2023 conference, June 14-15, 2023. The deadline for abstract submission is April 1.

1.8 Contribute to the Class Notes

  • Start a new branch and switch to the new branch.
  • On the new branch, add a qmd file for your presentation
  • Edit _quarto.yml add a line for your qmd file to include it in the notes.
  • Work on your qmd file, test with quarto render.
  • When satisfied, commit and make a pull request.

I have added a template file mysection.qmd and a new line to _quarto.yml as an example.

For more detailed style guidance, please see my notes on statistical wrirting.

1.9 Homework Requirements

  • Use the repo from Git Classroom to submit your work. See Chapter 2.
    • Keep the repo clean (no tracking generated files).
    • Never “Upload” your files; use the git command lines.
    • Make commit message informative (think about the readers).
  • Use quarto source only. See Chapter 3.
  • For the conveinence of greading, add your html output to a release in your repo.

1.10 My Presentation Topic (Template)

1.10.1 Introduction

Put an overview here. Use Markdown syntax.

1.10.2 Sub Topic 1

Put materials on topic 1 here

Python examples can be put into python code chunks:

import pandas as pd

# do something

1.10.3 Sub Topic 2

Put materials on topic 2 here.

1.10.4 Conclusion

Put sumaries here.

1.11 Practical Tips

1.11.1 Data analysis

  • Use an IDE so you can play with the data interactively
  • Collect codes that have tested out into a script for batch processing
  • During data cleaning, keep in mind how each variable will be used later
  • No keeping large data files in a repo; assume a reasonable location with your collaborators

1.11.2 Presentation

  • Don’t forget to introduce yourself if there is no moderator
  • Highlight your research questions and results, not code
  • Give an outline, carry it out, and summarize