1. Introduction¶

1.1. What is Data Science?¶

One widely accepted concept is the three pillars of data science: mathematics/statistics, computer science, and domain knowledge.

In her 2014 Presidential Address, Prof. Bin Yu, then President of the Institute of Mathematical Statistics, gave an interesting definition:

\[\mbox{Data Science} = \mbox{S}\mbox{D}\mbox{C}^3,\]

where S is Statistics, D is domain/science knowledge, and the three C’s are computing, collaboration/teamwork, and communication to outsiders.

1.2. Computing Environment¶

All setups are operating system dependent.

As soon as possible, stay away from Windows. Otherwise, good luck (you need it).

1.2.1. Command line interface¶

Ubunto Linux for beginners.

1.2.2. Python¶

Install Python package manager miniconda or pip.
Install Python
Install an IDE (Jupyter Notebook or VS Code)

1.2.3. A book project with Jupyter-book¶

Markdown for text
Jupyter notebook for code demo
Jupytext has nice features handling markdown source
Install jupyter-book with pip install jupyter-book

To build the book, go to the source directory of the book. under notes, run jb build .

1.2.4. MyST Markdown¶

Markedly Structured Text (MyST) examples:

Add my admonition

Adding my little admonition

Note

Initial

Warning

warning

Note

A note written in reStructuredText.

print("Hello!")

Hello!

Listing 1.1 This is my multi-line caption. It is pretty nifty¶

a = 2
print('my 1st line')
print(f'my {a}nd line')

Here’s my title

Here’s my admonition content

(1.1)¶\[ax^{2} + bx + c\]

The basic quadratic equation, (1.1), allows for the construction of all kinds of parabolas

1.3. Data Challenges in Action¶

1.4. NYC Open Data Week Event¶

1.4.1. Open Data in a Classroom¶

The NYC Open Data provides opportunities for data scientists to demonstrate what data science can do in real life. Students taking Introduction to Data Science (STAT3255/5255) in Spring 2022 at UConn are required to work on a project of their choice with any dataset on NYC Open Data. The topics will be a mix of instructor recommendations and self selections, covering transportation, construction, education, finance, and health, among others. Examples are the motor vehicle collisions crashes; DOB job application filling. NYC leading causes of death. and for students in data science. data science education. Five students will be selected from the class to present their works in a virtual panel. The presented project will be made public in crowd-sourced open class notes, facilitating real-life open data projects in data science education.

1.4.2. Dates¶

Four students will present during our lecture hour on Tuesday, March 8, 2022, to the general public as part of the NYC Open Data Week. A Zoom meeting link will be shared later.

1.4.3. Preparation Problem¶

The random seed was set by the class on Tuesday, Feb. 22. The first seven after the random permutation work on the NYC collision data and the rest eight work on the DOB job application data.

import random

presenters = ["Aimandi",  "Busa", "Campman", "Chandy", "Fodderwala",
              "Hughes", "Lin", "Mcclurg-Wong", "Schoenfeld", "Shamirian",
              "Sharma", "Taffe", "Xu", "Zeimbekakis", "Zheng"]
random.seed(370812509)
random.sample(presenters, 15)

['Sharma',
 'Zheng',
 'Campman',
 'Mcclurg-Wong',
 'Fodderwala',
 'Aimandi',
 'Zeimbekakis',
 'Chandy',
 'Hughes',
 'Schoenfeld',
 'Busa',
 'Xu',
 'Taffe',
 'Shamirian',
 'Lin']

1.5. Wish List¶

This is a wish list from all members of the class (alphabetical order). Add yours; note the syntax of nested list in Markdown.

Aimandi, Sakeena
- Complete a data science project to gain some meaningful insight and solve a real-world problem
Chandy, Mathew
- Learn the fundamentals of data science
- Become more comfortable with command line interface
- Work on a data science project involving a meaningful topic
Campman, Benjamin
- Develop more concrete ability and knowledge of Python
- Understand basics of Git
Hughes, Sam
- Become proficient with git
- Learn data visualization techniques in Python
- Learn reinforcement learning through Python
- Learn K Nearest Neighbors Algorithm
McClurg, Taelor
- Gain hands on experience using data science tools to analyze real data and answer interesting questions.
- Improve my comfort level with Python.
- Explore data visualization tools available in Python.
- Become proficient in git.
Shamirian, Robbie
- Continue to grow my Python skills in relation to data science
- Understand the fundamentals of Git
Sharma, Sinchan
- Become adept at using Git and GitHub for projects
- Become more comfortable with using Python
- Learn data analysis and visualization techniques using Python
Taffe, Thalia
- Better understand cmd functionality
Xu, Zhenyu
- Get familiar with Git and Github like a professional data scientist
- Learn more Python skills
- Participate in a practical data science competition
Yan, Jun
- Make data science accessible to undergraduates
- Co-develop a notebook with Jupyter-book in collaboration with students
Zeimbekakis, Anthony
- Become proficient in using command line with tools such as GitHub or even just file management
- Apply past knowledge of Python to be comfortable performing data science tasks

1.6. Task Board¶

Project management
- Markdown basics
- Jupyter notebook
- Jupyter book
- Jun Yan ~~Git basics~~
Python refreshment
- Vectors, matrices, and arrays
- Distributions
- Optimization
- Data manipulation
Visualization
- Package matplotlib (Tu, Feb. 8: Taelor McClurg)
- Package cartopy/basemap (Th, Feb. 10: Sam Hughes)
- Package plotnine (ggplot2 equivalent) (Tu. Feb. 15: Mathew Chandy)
- Google Maps Plot (Thurs. Feb 17: Robbie Shamirian)
- Handling spatial data
Statistical tests and models
- Desribing statistical models with package patsy (Th. Feb. 17: Anthony Zeimbekakis)
- Statistical models and hypothesis tests with package statsmodels
Supervised learning
- Decision trees (Th. Feb. 24: Ryan Schoenfeld)
- Support vector machine (Tu. March 1: Peter Busa)
- Random forests (Th. Mar. 3: Sinchan Sharma)
- Nearest neighbor
- Ensembool methods
  - Bagging
  - Boosting
  - XGBoost
Unsupervised learning
- K-means clustering
- Gaussian mixture models
Neural networks and deep learning
Further interests
- MIT App Inventor turorial
- Using R from Python and vice versa
- Building a Python module

1.7. First Topic Presentation Sign-up¶

Date	Presenter	Topic
02/08	Taelor McClurg	package `matplotlib`
02/10	Samuel Hughes	package `cartopy`
02/15	Mathew Chandy	package `plotnine`
02/17	Robert Shamirian	package `ggmap`
02/17	Anthony Zeimbekakis	package `patsy`
02/24	Ryan Schoenfeld	decision trees
03/03	Sinchan Sharma	random forest
03/10	Ben Campman	markown Basics
03/22	Zhenyu Xu	K-means clustering
03/29	Peter Busa	support vector machine
03/29	Sakeena Aimandi	bagging and boosting in action
04/07	Thalia	k nearest neighbor
04/21	Juncheng Zheng	neural network

1.8. Second Topic Presentation Randomization¶

presenters = ["Aimandi",  "Busa", "Campman", "Chandy", 
              "Hughes", "Lin", # "Mcclurg-Wong", 
              "Schoenfeld", "Shamirian",
              "Sharma", "Taffe", "Xu", "Zeimbekakis", "Zheng"]
import random
random.seed(4865973917) # jointly set by the class on March 23, 2022
random.sample(presenters, len(presenters))

['Xu',
 'Lin',
 'Busa',
 'Sharma',
 'Zheng',
 'Zeimbekakis',
 'Schoenfeld',
 'Shamirian',
 'Aimandi',
 'Hughes',
 'Taffe',
 'Campman',
 'Chandy']

Please identify a topic one week before your presentation and push your material one day before the presentation so I can give some feedback.

Date	Presenter	Topic
03/29	Taelor Mcclurg	mapping areal data
03/31	Zhenyu Xu	Gaussian Mixture Models
04/05	Sinchan Sharma	Interactive Visualizations
04/07	Peter Busa	MIT app inventor
04/07	Ryan Schoenfeld	XGBoost
04/12	Robbie Shamirian	Pygal Visualizations
04/12	Sam Hughes	Generalized additive model
04/12	Anthony Zeimbekakis	Building a Python module
04/14	Juncheng Zheng	Using R from Python and vice versa
04/14	SakeenaAimandi
04/14	Thalia Taffe	Tensorflow
04/14	Benjamin Campman	Self-Made Stepwise Regression
04/14	Mathew Chandy	Web Scraping

1.9. Final Presentation Randomization¶

Only undergraduates are required to do a presentation on the final project. Graduate students submit a final report.

presenters = ["Aimandi",  "Busa", "Campman", "Chandy", 
              "Hughes", "Lin",
              "Schoenfeld", "Shamirian",
              "Taffe", "Zeimbekakis", "Zheng"]
import random
random.seed(64017) # to be set by the class
random.sample(presenters, len(presenters))

['Zheng',
 'Shamirian',
 'Campman',
 'Chandy',
 'Lin',
 'Aimandi',
 'Taffe',
 'Busa',
 'Zeimbekakis',
 'Schoenfeld',
 'Hughes']

Introduction to Data Science, Spring 2022

Introduction

Contents