1 Introduction
1.1 What Is Data Science?
Data science is a multifaceted field, often conceptualized as resting on three fundamental pillars: mathematics/statistics, computer science, and domain-specific knowledge. This framework helps to underscore the interdisciplinary nature of data science, where expertise in one area is often complemented by foundational knowledge in the others.
A compelling definition was offered by Prof. Bin Yu in her 2014 Presidential Address to the Institute of Mathematical Statistics. She defines \[\begin{equation*} \mbox{Data Science} = \mbox{S}\mbox{D}\mbox{C}^3, \end{equation*}\] where
- ‘S’ represents Statistics, signifying the crucial role of statistical methods in understanding and interpreting data;
- ‘D’ stands for domain or science knowledge, indicating the importance of specialized expertise in a particular field of study;
- the three ’C’s denote computing, collaboration/teamwork, and communication to outsiders.
Computing underscores the need for proficiency in programming and algorithmic thinking, collaboration/teamwork reflects the inherently collaborative nature of data science projects, often requiring teams with diverse skill sets, and communication to outsiders emphasizes the importance of translating complex data insights into understandable and actionable information for non-experts.
This definition neatly captures the essence of data science, emphasizing a balance between technical skills, teamwork, and the ability to communicate effectively.
1.2 Expectations from This Course
In this course, students will be expected to achieve the following outcomes:
Proficiency in Project Management with Git: Develop a solid understanding of Git for efficient and effective project management. This involves mastering version control, branching, and collaboration through this powerful tool.
Proficiency in Project Reporting with Quarto: Gain expertise in using Quarto for professional-grade project reporting. This encompasses creating comprehensive and visually appealing reports that effectively communicate your findings.
Hands-On Experience with Real-World Data Science Projects: Engage in practical data science projects that reflect real-world scenarios. This hands-on approach is designed to provide you with direct experience in tackling actual data science challenges.
Competency in Using Python and Its Extensions for Data Science: Build strong skills in Python, focusing on its extensions relevant to data science. This includes libraries like Pandas, NumPy, and Matplotlib, among others, which are critical for data analysis and visualization.
Full Grasp of the Meaning of Results from Data Science Algorithms: Learn to not only apply data science algorithms but also to deeply understand the implications and meanings of their results. This is crucial for making informed decisions based on these outcomes.
Basic Understanding of the Principles of Data Science Methods: Acquire a foundational knowledge of the underlying principles of various data science methods. This understanding is key to effectively applying these methods in practice.
Commitment to the Ethics of Data Science: Emphasize the importance of ethical considerations in data science. This includes understanding data privacy, bias in data and algorithms, and the broader social implications of data science work.
1.3 Computing Environment
All setups are operating system dependent. As soon as possible, stay away from Windows. Otherwise, good luck (you will need it).
1.3.1 Operating System
Your computer has an operating system (OS), which is responsible for managing the software packages on your computer. Each operating system has its own package management system. For example:
Linux: Linux distributions have a variety of package managers depending on the distribution. For instance, Ubuntu uses APT (Advanced Package Tool), Fedora uses DNF (Dandified Yum), and Arch Linux uses Pacman. These package managers are integral to the Linux experience, allowing users to install, update, and manage software packages easily from repositories.
macOS: macOS uses Homebrew as its primary package manager. Homebrew simplifies the installation of software and tools that aren’t included in the standard macOS installation, using simple commands in the terminal.
Windows: Windows users often rely on the Microsoft Store for apps and software. For more developer-focused package management, tools like Chocolatey and Windows Package Manager (Winget) are used. Additionally, recent versions of Windows have introduced the Windows Subsystem for Linux (WSL). WSL allows Windows users to run a Linux environment directly on Windows, unifying Windows and Linux applications and tools. This is particularly useful for developers and data scientists who need to run Linux-specific software or scripts. It saves a lot of trouble Windows users used to have that users previously faced before WSL was introduced.
Understanding the package management system of your operating system is crucial for effectively managing and installing software, especially for data science tools and applications.
1.3.2 File System
A file system is a fundamental aspect of a computer’s operating system, responsible for managing how data is stored and retrieved on a storage device, such as a hard drive, SSD, or USB flash drive. Essentially, it provides a way for the OS and users to organize and keep track of files. Different operating systems typically use different file systems. For instance, NTFS and FAT32 are common in Windows, APFS and HFS+ in macOS, and Ext4 in many Linux distributions. Each file system has its own set of rules for controlling the allocation of space on the drive and the naming, storage, and access of files, which impacts performance, security, and compatibility. Understanding file systems is crucial for tasks such as data recovery, disk partitioning, and managing file permissions, making it an important concept for anyone working with computers, especially in data science and IT fields.
Navigating through folders in the command line, especially in Unix-like environments such as Linux or macOS, and Windows Subsystem for Linux (WSL), is an essential skill for effective file management. The command cd (change directory) is central to this process. To move into a specific directory, you use cd followed by the directory name, like cd Documents. To go up one level in the directory hierarchy, you use cd ... To return to the home directory, simply typing cd or cd ~ will suffice. The ls command lists all files and folders in the current directory, providing a clear view of your options for navigation. Mastering these commands, along with others like pwd (print working directory), which displays your current directory, equips you with the basics of moving around the file system in the command line, an indispensable skill for a wide range of computing tasks in Unix-like systems.
1.3.3 Command Line Interface
On Linux or MacOS, simply open a terminal.
On Windows, several options can be considered.
- Windows Subsystem Linux (WSL): https://learn.microsoft.com/en-us/windows/wsl/
- Cygwin (with X): https://x.cygwin.com
- Git Bash: https://www.gitkraken.com/blog/what-is-git-bash
To jump-start, here is a tutorial: Ubuntu Linux for beginners.
At least, you need to know how to handle files and traverse across directories. The tab completion and introspection supports are very useful.
Here are several commonly used shell commands:
cd: change directory;..means parent directory.pwd: present working directory.ls: list the content of a folder;-llong version;-ashow hidden files;-tordered by modification time.mkdir: create a new directory.cp: copy file/folder from a source to a target.mv: move file/folder from a source to a target.rm: remove a file a folder.
1.3.4 Python
Set up Python on your computer:
- Python 3.
- Python package manager miniconda or pip.
- Integrated Development Environment (IDE) (Jupyter Notebook; RStudio; VS Code; Emacs; etc.)
I will be using VS Code in class.
Readability is important! Check your Python coding style against the recommended styles: https://peps.python.org/pep-0008/. A good place to start is the Section on “Code Lay-out”.
Online books on Python for data science:
- “Python Data Science Handbook: Essential Tools for Working with Data,” First Edition, by Jake VanderPlas, O’Reilly Media, 2016.
- “Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython.” Third Edition, by Wes McKinney, O’Reilly Media, 2022.