Introduction
Contents
1. Introduction¶
1.1. What is Data Science?¶
One widely accepted concept is the three pillars of data science: mathematics/statistics, computer science, and domain knowledge.
In her 2014 Presidential Address, Prof. Bin Yu, then President of the Institute of Mathematical Statistics, gave an interesting definition:
where S is Statistics, D is domain/science knowledge, and the three C’s are computing, collaboration/teamwork, and communication to outsiders.
1.2. Computing Environment¶
All setups are operating system dependent.
As soon as possible, stay away from Windows. Otherwise, good luck (you need it).
1.2.1. Command line interface¶
1.2.2. Python¶
Install Python package manager miniconda or pip.
Install Python
Install an IDE (Jupyter Notebook or VS Code)
1.2.3. A book project with Jupyter-book¶
Markdown for text
Jupyter notebook for code demo
Jupytext has nice features handling markdown source
Install
jupyter-book
withpip install jupyter-book
To build the book, go to the source directory of the book. under
notes
, run jb build .
1.2.4. MyST Markdown¶
Markedly Structured Text (MyST) examples:
Add my admonition
Adding my little admonition
Note
Initial
Warning
warning
Note
A note written in reStructuredText.
print("Hello!")
Hello!
10a = 2
11print('my 1st line')
12print(f'my {a}nd line')
Here’s my title
Here’s my admonition content
The basic quadratic equation, (1.1), allows for the construction of all kinds of parabolas
1.3. Data Challenges in Action¶
1.4. NYC Open Data Week Event¶
1.4.1. Open Data in a Classroom¶
The NYC Open Data provides opportunities for data scientists to demonstrate what data science can do in real life. Students taking Introduction to Data Science (STAT3255/5255) in Spring 2022 at UConn are required to work on a project of their choice with any dataset on NYC Open Data. The topics will be a mix of instructor recommendations and self selections, covering transportation, construction, education, finance, and health, among others. Examples are the motor vehicle collisions crashes; DOB job application filling. NYC leading causes of death. and for students in data science. data science education. Five students will be selected from the class to present their works in a virtual panel. The presented project will be made public in crowd-sourced open class notes, facilitating real-life open data projects in data science education.
1.4.2. Dates¶
Four students will present during our lecture hour on Tuesday, March 8, 2022, to the general public as part of the NYC Open Data Week. A Zoom meeting link will be shared later.
1.4.3. Preparation Problem¶
The random seed was set by the class on Tuesday, Feb. 22. The first seven after the random permutation work on the NYC collision data and the rest eight work on the DOB job application data.
import random
presenters = ["Aimandi", "Busa", "Campman", "Chandy", "Fodderwala",
"Hughes", "Lin", "Mcclurg-Wong", "Schoenfeld", "Shamirian",
"Sharma", "Taffe", "Xu", "Zeimbekakis", "Zheng"]
random.seed(370812509)
random.sample(presenters, 15)
['Sharma',
'Zheng',
'Campman',
'Mcclurg-Wong',
'Fodderwala',
'Aimandi',
'Zeimbekakis',
'Chandy',
'Hughes',
'Schoenfeld',
'Busa',
'Xu',
'Taffe',
'Shamirian',
'Lin']
1.5. Wish List¶
This is a wish list from all members of the class (alphabetical order). Add yours; note the syntax of nested list in Markdown.
Aimandi, Sakeena
Complete a data science project to gain some meaningful insight and solve a real-world problem
Chandy, Mathew
Learn the fundamentals of data science
Become more comfortable with command line interface
Work on a data science project involving a meaningful topic
Campman, Benjamin
Develop more concrete ability and knowledge of Python
Understand basics of Git
Hughes, Sam
Become proficient with git
Learn data visualization techniques in Python
Learn reinforcement learning through Python
Learn K Nearest Neighbors Algorithm
McClurg, Taelor
Gain hands on experience using data science tools to analyze real data and answer interesting questions.
Improve my comfort level with Python.
Explore data visualization tools available in Python.
Become proficient in git.
Shamirian, Robbie
Continue to grow my Python skills in relation to data science
Understand the fundamentals of Git
Sharma, Sinchan
Become adept at using Git and GitHub for projects
Become more comfortable with using Python
Learn data analysis and visualization techniques using Python
Taffe, Thalia
Better understand cmd functionality
Xu, Zhenyu
Get familiar with Git and Github like a professional data scientist
Learn more Python skills
Participate in a practical data science competition
Yan, Jun
Make data science accessible to undergraduates
Co-develop a notebook with Jupyter-book in collaboration with students
Zeimbekakis, Anthony
Become proficient in using command line with tools such as GitHub or even just file management
Apply past knowledge of Python to be comfortable performing data science tasks
1.6. Task Board¶
Project management
Markdown basics
Jupyter notebook
Jupyter book
Jun Yan ~~Git basics~~
Python refreshment
Vectors, matrices, and arrays
Distributions
Optimization
Data manipulation
Visualization
Package matplotlib (Tu, Feb. 8: Taelor McClurg)
Package cartopy/basemap (Th, Feb. 10: Sam Hughes)
Package plotnine (ggplot2 equivalent) (Tu. Feb. 15: Mathew Chandy)
Google Maps Plot (Thurs. Feb 17: Robbie Shamirian)
Handling spatial data
Statistical tests and models
Desribing statistical models with package
patsy
(Th. Feb. 17: Anthony Zeimbekakis)Statistical models and hypothesis tests with package
statsmodels
Supervised learning
Decision trees (Th. Feb. 24: Ryan Schoenfeld)
Support vector machine (Tu. March 1: Peter Busa)
Random forests (Th. Mar. 3: Sinchan Sharma)
Nearest neighbor
Ensembool methods
Bagging
Boosting
XGBoost
Unsupervised learning
K-means clustering
Gaussian mixture models
Neural networks and deep learning
Further interests
MIT App Inventor turorial
Using R from Python and vice versa
Building a Python module
1.7. First Topic Presentation Sign-up¶
Date |
Presenter |
Topic |
---|---|---|
02/08 |
Taelor McClurg |
package |
02/10 |
Samuel Hughes |
package |
02/15 |
Mathew Chandy |
package |
02/17 |
Robert Shamirian |
package |
02/17 |
Anthony Zeimbekakis |
package |
02/24 |
Ryan Schoenfeld |
decision trees |
03/03 |
Sinchan Sharma |
random forest |
03/10 |
Ben Campman |
markown Basics |
03/22 |
Zhenyu Xu |
K-means clustering |
03/29 |
Peter Busa |
support vector machine |
03/29 |
Sakeena Aimandi |
bagging and boosting in action |
04/07 |
Thalia |
k nearest neighbor |
04/21 |
Juncheng Zheng |
neural network |
1.8. Second Topic Presentation Randomization¶
presenters = ["Aimandi", "Busa", "Campman", "Chandy",
"Hughes", "Lin", # "Mcclurg-Wong",
"Schoenfeld", "Shamirian",
"Sharma", "Taffe", "Xu", "Zeimbekakis", "Zheng"]
import random
random.seed(4865973917) # jointly set by the class on March 23, 2022
random.sample(presenters, len(presenters))
['Xu',
'Lin',
'Busa',
'Sharma',
'Zheng',
'Zeimbekakis',
'Schoenfeld',
'Shamirian',
'Aimandi',
'Hughes',
'Taffe',
'Campman',
'Chandy']
Please identify a topic one week before your presentation and push your material one day before the presentation so I can give some feedback.
Date |
Presenter |
Topic |
---|---|---|
03/29 |
Taelor Mcclurg |
mapping areal data |
03/31 |
Zhenyu Xu |
Gaussian Mixture Models |
04/05 |
Sinchan Sharma |
Interactive Visualizations |
04/07 |
Peter Busa |
MIT app inventor |
04/07 |
Ryan Schoenfeld |
XGBoost |
04/12 |
Robbie Shamirian |
Pygal Visualizations |
04/12 |
Sam Hughes |
Generalized additive model |
04/12 |
Anthony Zeimbekakis |
Building a Python module |
04/14 |
Juncheng Zheng |
Using R from Python and vice versa |
04/14 |
SakeenaAimandi |
|
04/14 |
Thalia Taffe |
Tensorflow |
04/14 |
Benjamin Campman |
Self-Made Stepwise Regression |
04/14 |
Mathew Chandy |
Web Scraping |
1.9. Final Presentation Randomization¶
Only undergraduates are required to do a presentation on the final project. Graduate students submit a final report.
presenters = ["Aimandi", "Busa", "Campman", "Chandy",
"Hughes", "Lin",
"Schoenfeld", "Shamirian",
"Taffe", "Zeimbekakis", "Zheng"]
import random
random.seed(64017) # to be set by the class
random.sample(presenters, len(presenters))
['Zheng',
'Shamirian',
'Campman',
'Chandy',
'Lin',
'Aimandi',
'Taffe',
'Busa',
'Zeimbekakis',
'Schoenfeld',
'Hughes']