Data Science and Society with R
DSDA/STAT 1010 - Quarto Book
Preliminaries
Welcome! This Quarto book hosts all lecture notes, in-class activities, and weekly reading for DSDA/STAT 1010. It is designed for first-year students with no prerequisites.
The notes were developed with Quarto; for details about Quarto, visit https://quarto.org/docs/books.
This book is free and is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.
Sources at GitHub
These lecture notes for STAT/DSDA 1010 in Fall 2025 are developed by Professor Jun Yan, with help from generative AI and the students enrolled in the course. This cooperative approach to education was facilitated through the use of GitHub, a platform that encourages collaborative coding and content development. To view these contributions and the lecture notes in their entirety, please visit our GitHub repository at https://github.com/statds/1010f25.
Students are welcome to contribute to the lecture notes by submitting pull requests to our GitHub repository. This method not only enriched the course material but also provided students with practical experience in collaborative software development and version control.
Adapting to Rapid Skill Acquisition
In this course, students are expected to rapidly acquire new skills, a critical aspect of data science. To emphasize this, consider this insightful quote from VanderPlas (2016):
When a technologically-minded person is asked to help a friend, family member, or colleague with a computer problem, most of the time it’s less a matter of knowing the answer as much as knowing how to quickly find an unknown answer. In data science it’s the same: searchable web resources such as online documentation, mailing-list threads, and StackOverflow answers contain a wealth of information, even (especially?) if it is a topic you’ve found yourself searching before. Being an effective practitioner of data science is less about memorizing the tool or command you should use for every possible situation, and more about learning to effectively find the information you don’t know, whether through a web search engine or another means.
This quote captures the essence of what we aim to develop in our students: the ability to swiftly navigate and utilize the vast resources available to solve complex problems in data science. Examples tasks are: install needed software (or even hardware); search and find solutions to encountered problems.
Course Tools
- R & RStudio for analysis
- Quarto for reproducible documents and dashboards
- Git & GitHub for version control and project management
- Command line for automation and efficiency
Policies & Syllabus
See the course syllabus on HuskyCT.
Key reminders: academic integrity, no AI-generated text in graded submissions, and professional email etiquette.
Grading Rubrics
Baseline (C level work)
- Your
.qmdfile knits to HTML without errors. - You answer questions correctly but do not use complete sentences.
- There are typos and ‘junk code’ throughout the document.
- You do not put much thought or effort into the reflection answers.
- You do not follow the good styles in using R, Quarto, and Git.
Average (B level work)
- You use complete sentences to answer questions.
- You attempt every exercise/question.
Advanced (A level work)
- Your code is simple and concise.
- Unnecessary messages from R are hidden from being displayed in the HTML.
- Your document is typo-free.
- You practice all the good styles of using R, Quarto, and Git.
- At the discretion of the instructor, you give exceptionally thoughtful or insightful responses.
Homework Logistics
Workflow of Submitting Homework Assignment
- Click the GitHub classroom assignment link in HuskyCT announcement.
- Accept the assignment and follow the instructions to an empty repository.
- Make a clone of the repo at an appropriate folder on your own computer with
git clone. - Go to this folder, add your qmd source, work on it, and group your changes to different, meaningful commits.
- Push your work to your GitHub repo with
git push. - Create a new release and put the generated pdf file in it for ease of grading.
Homework Requirements
- Use the repo from Git Classroom to submit your work. See Section 2 Project Management with Git.
- Keep the repo clean (no tracking generated files).
- Never “Upload” your files; use the git command lines.
- Make commit message informative (think about the readers).
- Make at least 10 commits and form a style of frequent small commits.
- Use
quartosource only. See Install R, Positron (or RStudio), and Quarto. - For the convenience of grading, add your standalone pdf output to a release in your repo.
Final Project Logistics
Many data science professionals work independently while collaborating across teams. This final project will give you experience completing a full data analysis cycle on your own, following the Tidy Data Workflow to transform questions into a well-designed product that communicates clear findings.
For your final project, you will work individually to identify a data set and complete a data analysis that answers one or more questions. Your project should generally follow the process of the Tidy Data Workflow. As this charge can be interpreted broadly, more information is outlined below.
This project is designed as a capstone experience, bringing together most of what you have learned this semester. You will complete all aspects of the project yourself—planning, analysis, visualization, and communication—to simulate the end-to-end workflow of a professional data scientist. The goal is to illuminate the issue through visualization of data and encourage exploration by providing a user interface in an R Shiny application.
Start identifying your data source early. A good data set is rich, clean enough to work with, and aligned with your interests.
Goals
- Produce an end-to-end data science project
- Take full ownership of your analysis and communication
- Identify interesting data sets, develop questions that can be answered with available data, use data science techniques to draw insights, and present those insights clearly and convincingly
Deliverables
- Project Proposal
- Exploratory Data Analysis
- Final Presentation
- Final Report
Rationale for the Individual Project
- Practices independent data science work from start to finish.
- Encourages personal accountability and self-management.
- Provides flexibility to pursue topics reflecting individual interests and strengths.
- Gives practice communicating insights in multiple formats.
- Ensures that each student demonstrates mastery of all phases of the data science workflow.
Project Grading
- Total project grade out of 40 points.
- Project Proposal = 5 points.
- Exploratory Data Analysis / Proof of Concept = 5 points.
- Final Presentation = 20 points.
- Final User Report = 10 points.
To eliminate subjectivity from grading:
- Follow instructions. - Meet deadlines. - Address all rubric items. - Produce a cohesive, reproducible workflow.
Areas of subjectivity include verbal and written communication, graphics grammar, product design, and text mechanics.
Follow each rubric item carefully and proofread your materials before submission.
Project Expectations
You are responsible for managing your own timeline, ensuring that each deliverable is completed on time, and seeking feedback from the instructor when necessary. You may informally discuss ideas with peers, but all coding, analysis, and writing must be your own work.
Part 1: Proposal
- Select and use one publicly available data sources.
- Submit two initial questions you intend to answer.
- Innovative thought
- Non-trivial
- Data can be used to answer your questions
- Schedule a 15-minute meeting with me Nov. 14 to present your proposal.
- Describe chosen data sources
- Outline variables and types
- Present initial research questions and variables of interest
- Keep it concise
Here is a template for your project proposal.
Proposal meeting by Nov. 14. Bring a short document and be prepared to explain your data and questions.
Part 2: Exploratory Data Analysis
- Choose two of your initial questions to explore in depth
- Provide an overview of the data, describe “things to explore,” and interpret results
- Display a map with at least one UI control in your dashboard
- Display at least three different types of plots (geoms) responding to one or more UI controls
- Written summary (edited, proofread HTML document)
- At least one page
- Describe your questions and data
- Depict and discuss plots and relationships
- Provide conclusions drawn from the analysis
- At least one page
Part 3: Presentation and Report
- Present your interactive R Shiny dashboard.
- Presentation length: 15 minutes.
- Explain the data used, show compelling visuals, discuss methods, and summarize your accompanying report.
Here is a template for presentation slides with Quarto.
Advice
- Choose data that are interesting and rich enough for meaningful analysis
- Formulate questions that can lead to clear, data-driven answers
- Plan your work schedule early and stick to it
- Edit, proofread, and polish all deliverables carefully
- Reach out to me early if challenges arise
Reserve the last week for polishing your Shiny app and final report.
Finding (non-trivial) Data
Schedule and Readings
- Computing environment
- R4DS Ch 28-29
- HGR Ch 20-23
- Jump start with R
- R4DS Ch 4-8; Ch 20-24
- Visualization
- R4DS Ch 1; Ch 9; Ch 11
- Data visualization in R
- Data manipulation
- R4DS Ch 3
- Exploring data
- R4DS Ch 10; Ch 12-13
- Tidy data
- R4DS Ch 5; Ch 19.1-19.2
- Relational Data
- R4DS Ch 14; Ch 16; Ch 17; Ch 19.3-19.6
- Geospatial Data
- USDR Ch 1; Ch 3.1-3.6
- GCR Ch1; Ch 8.1-8.4
- Shiny
- MSR Ch 1-4; Ch12; Ch 5-6