Introduction to Data Science

1.1 What Is Data Science?

One widely accepted concept is the three pillars of data science: mathematics/statistics, computer science, and domain knowledge.

In her 2014 Presidential Address, Prof. Bin Yu, then President of the Institute of Mathematical Statistics, gave an interesting definition: \[ \mbox{Data Science} = \mbox{S}\mbox{D}\mbox{C}^3, \] where S is Statistics, D is domain/science knowledge, and the three C’s are computing, collaboration/teamwork, and communication to outsiders.

1.2 Expectations from This Course

Proficiency in project management with Git.
Proficiency in project report with Quarto.
Hands-on experience with real-world data science project.
Competency in using Python and its extensions for data science.
Full grasp of the meaning of the results from data science algorithms.
Basic understanding the principles of the data science methods.

1.3 Computing Environment

All setups are operating system dependent. As soon as possible, stay away from Windows. Otherwise, good luck (you will need it).

1.3.1 Command Line Interface

On Linux or MacOS, simply open a terminal.

On Windows, several options can be considered.

Cygwin (with X): https://x.cygwin.com
Git Bash: https://www.gitkraken.com/blog/what-is-git-bash

To jump start, here is a tutorial: Ubunto Linux for beginners.

At least, you need to know how to handle files and traverse across directories. The tab completion and introspection supports are very useful.

1.3.2 Python

Set up Python on your computer:

Python 3.
Python package manager miniconda or pip.
Integrated Development Environment (IDE) (Jupyter Notebook; RStudio; VS Code; Emacs; etc.)

I will be using IPython and Jupyter Notebook in class.

Readability is important! Check your Python coding styles against the recommended styles: https://peps.python.org/pep-0008/. A good place to start is the Section on “Code Lay-out”.

Online books on Python for data science:

“Python Data Science Handbook: Essential Tools for Working with Data,” First Edition, by Jake VanderPlas, O’Reilly Media, 2016.

“Python for Data Analysis: Data Wrangling with Pan- das, NumPy, and IPython.” Third Edition, by Wes McK- inney, O’Reilly Media, 2022.

1.4 Data Challenges

1.5 Wishlist

This is a wish list from all members of the class (alphabetical order). Add yours; note the syntax of nested list in Markdown.

1.5.1 Presentation Orders

The topic presentation order is set up in class.

presenters = ["Alsubai, Nadia",
              "Bedard, Kaitlyn",
              "Cheu, Catherine",
              "Chua, Yang Kang",
              "Cummins, Patrick",
              "Ho, Garrick",
              "Jones, Courtney",
              "Karandikar, Shivaram",
              "Lunetta, Giovanni",
              "Mastrorilli, Ginamarie",
              "Nguyen, Christine",
              "Nhan, Nathan",
              "Noel, Luke",
              "Parchekani, Kian",
              "Shen, Tong",
              "Sullivan, Colin",
              "Wang, Chaoyang",
              "Whitney, William",
              "Yeung, Shannon",
              "Yi, Guanghong",
              "Zheng, Michael"]

import random
random.seed(71323498112697523) # jointly set by the class on 01/30/2023
random.sample(presenters, len(presenters))

['Cheu, Catherine',
 'Ho, Garrick',
 'Mastrorilli, Ginamarie',
 'Yi, Guanghong',
 'Karandikar, Shivaram',
 'Chua, Yang Kang',
 'Jones, Courtney',
 'Sullivan, Colin',
 'Shen, Tong',
 'Alsubai, Nadia',
 'Yeung, Shannon',
 'Bedard, Kaitlyn',
 'Nhan, Nathan',
 'Parchekani, Kian',
 'Noel, Luke',
 'Whitney, William',
 'Wang, Chaoyang',
 'Nguyen, Christine',
 'Cummins, Patrick',
 'Zheng, Michael',
 'Lunetta, Giovanni']

1.6 Presentation Task Board

Here are some example tasks:

Import/Export data
Descriptive statistics
Statistical hypothesis tests scypy.stats
Model formulas with patsy
Statistical models with statsmodels
Data visualization with matplotlib
Grammer of graphics for python plotnine
Handling spatial data with geopandas
Show your Data in a Google map with gmplot
Random forest
Naive Bayes
Bagging vs boosting
Calling C/C++ from Python
Calling R from Python and vice versa
Develop a Python module
Neural networks
Deep learning
TensorFlow
Autoencoders
Reinforcement learning

Please use the following table to sign up.

Date	Presenter	Topic
02/06	Cheu, Catherine	Visualization with `matplotlib`
02/08	Ho, Garrick	Pandas part 1
02/13	Mastrorilli, Ginamarie	Pandas part 2
02/13	Yi, Guanghong	Grammer of graphics for python `plotnine`
02/15	Karandikar, Shivaram	Text processing with `nltk`
02/20	Chua, Yang Kang	Support Vector Machine with `scikit-learn`
02/20	Jones, Courtney	Descriptive Statistics
02/22	Sullivan, Colin	Statistical hypothesis tests scypy.stats
02/27	Shen, Tong	Decision tree with `scikit-learn`
03/01	Bedard, Kaitlyn	Handling spatial data with `geopandas`
03/06	Nhan, Nathan	Bagging vs boosting
03/08	Parchekani, Kian	Naive Bayes
03/20	Noel, Luke	Plotting on maps with `gmplot`
03/20	Whitney, William	Autoencoder
03/27	Nguyen, Christine	Calling R from Python and vice versa
03/27	Cummins, Patrick	K-means clustering
03/29	Zheng, Michael	Web Scraping with `Selenium`
04/03	Lunetta, Giovanni	Softmax Regression & Neural Networks with `TensorFlow`

1.7 Final Project Presentation Schedule

We use the same order as the topic presentation for undergraduate final presentation.

Date	Presenter
04/17	Ho, Garrick
04/17	Mastrorilli, Ginamarie
04/17	Yi, Guanghong
04/17	Karandikar, Shivaram
04/19	Jones, Courtney
04/19	Sullivan, Colin
04/19	Bedard, Kaitlyn
04/19	Nhan, Nathan
04/24	Parchekani, Kian
04/24	Noel, Luke
04/24	Whitney, William
04/24	Nguyen, Christine
04/26	Cummins, Patrick
04/26	Zheng, Michael
04/26	Lunetta, Giovanni

I encourage you to work on NYC open data or other open data for your projects and submit an abstract to the Government Advances in Statistical Programming (GASP) 2023 conference, June 14-15, 2023. The deadline for abstract submission is April 1.

1.8 Contribute to the Class Notes

Start a new branch and switch to the new branch.
On the new branch, add a qmd file for your presentation
Edit _quarto.yml add a line for your qmd file to include it in the notes.
Work on your qmd file, test with quarto render.
When satisfied, commit and make a pull request.

I have added a template file mysection.qmd and a new line to _quarto.yml as an example.

For more detailed style guidance, please see my notes on statistical wrirting.

1.9 Homework Requirements

Use the repo from Git Classroom to submit your work. See Chapter 2.
- Keep the repo clean (no tracking generated files).
- Never “Upload” your files; use the git command lines.
- Make commit message informative (think about the readers).
Use quarto source only. See Chapter 3.
For the conveinence of greading, add your html output to a release in your repo.

1.10 My Presentation Topic (Template)

1.10.1 Introduction

Put an overview here. Use Markdown syntax.

1.10.2 Sub Topic 1

Put materials on topic 1 here

Python examples can be put into python code chunks:

import pandas as pd

# do something

1.10.3 Sub Topic 2

Put materials on topic 2 here.

1.10.4 Conclusion

Put sumaries here.

1.11 Practical Tips

1.11.1 Data analysis

Use an IDE so you can play with the data interactively
Collect codes that have tested out into a script for batch processing
During data cleaning, keep in mind how each variable will be used later
No keeping large data files in a repo; assume a reasonable location with your collaborators

1.11.2 Presentation

Don’t forget to introduce yourself if there is no moderator
Highlight your research questions and results, not code
Give an outline, carry it out, and summarize