6 Data Acquisition & Engineering

6.1 Import and Export Data

This presentation was prepared by Zaynab Faisal.

6.1.1 Introduction

Data analysis begins with importing data and ends with exporting results. Before creating visualizations or running models, the dataset must be loaded correctly into memory. Small mistakes such as incorrect file paths, wrong delimiters or unparsed date columns can change how the data behaves. If these issues are not identified early, they affect every step of the workflow.

This presentation focuses on building a reproducible importing and exporting workflow using pandas in Python.

6.1.2 Pandas

Pandas is a Python library used for working with tabular data. It allows data to be stored in a DataFrame, which is similar to a spreadsheet or SQL table.

6.1.3 Importing CSV Files

CSV (Comma Separated Values) files are the most common format for storing structured data. In pandas, CSV files are imported using the read_csv() function.

6.1.3.1 Real-World Example

The dataset used in this example is nyc_crashes_lbdwk_2025.csv, located in the data folder of the course repository. This file contains weekly crash records and has not been cleaned in order to demonstrate common issues that arise during the importing process.

import pandas as pd

#Import CSV file into Data Frame
df = pd.read_csv("data/nyc_crashes_lbdwk_2025.csv")

#Show the first 10 rows of the dataset. 
df.head(10)

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME	CROSS STREET NAME	OFF STREET NAME	...	CONTRIBUTING FACTOR VEHICLE 2	CONTRIBUTING FACTOR VEHICLE 3	CONTRIBUTING FACTOR VEHICLE 4	CONTRIBUTING FACTOR VEHICLE 5	COLLISION_ID	VEHICLE TYPE CODE 1	VEHICLE TYPE CODE 2	VEHICLE TYPE CODE 3	VEHICLE TYPE CODE 4	VEHICLE TYPE CODE 5
0	08/31/2025	12:49	QUEENS	11101.0	40.753113	-73.933700	(40.753113, -73.9337)	30 ST	39 AVE	NaN	...	NaN	NaN	NaN	NaN	4838875	Station Wagon/Sport Utility Vehicle	NaN	NaN	NaN	NaN
1	08/31/2025	15:30	MANHATTAN	10022.0	40.760600	-73.964320	(40.7606, -73.96432)	E 59 ST	2 AVE	NaN	...	NaN	NaN	NaN	NaN	4839110	Station Wagon/Sport Utility Vehicle	NaN	NaN	NaN	NaN
2	08/31/2025	19:00	NaN	NaN	40.734234	-73.722750	(40.734234, -73.72275)	CROSS ISLAND PARKWAY	HILLSIDE AVENUE	NaN	...	Unspecified	Unspecified	NaN	NaN	4838966	Sedan	Sedan	NaN	NaN	NaN
3	08/31/2025	1:19	BROOKLYN	11220.0	40.648075	-74.007034	(40.648075, -74.007034)	NaN	NaN	4415 5 AVE	...	Unspecified	NaN	NaN	NaN	4838563	Sedan	E-Bike	NaN	NaN	NaN
4	08/31/2025	2:41	MANHATTAN	10036.0	40.756560	-73.986110	(40.75656, -73.98611)	W 43 ST	BROADWAY	NaN	...	Unspecified	NaN	NaN	NaN	4838922	Station Wagon/Sport Utility Vehicle	Bike	NaN	NaN	NaN
5	08/31/2025	19:42	BRONX	10466.0	40.887672	-73.847046	(40.887672, -73.847046)	LACONIA AVE	GRENADA PL	NaN	...	NaN	NaN	NaN	NaN	4839184	NaN	NaN	NaN	NaN	NaN
6	08/31/2025	8:21	BROOKLYN	11221.0	40.689217	-73.917650	(40.689217, -73.91765)	BUSHWICK AVE	PUTNAM AVE	NaN	...	Unspecified	Unspecified	NaN	NaN	4839066	Sedan	Sedan	Station Wagon/Sport Utility Vehicle	NaN	NaN
7	08/31/2025	2:50	QUEENS	11411.0	40.685990	-73.728730	(40.68599, -73.72873)	120 AVE	236 ST	NaN	...	Unspecified	Unspecified	Unspecified	NaN	4838591	Sedan	Sedan	Sedan	Station Wagon/Sport Utility Vehicle	NaN
8	08/31/2025	21:45	QUEENS	11434.0	40.673023	-73.776566	(40.673023, -73.776566)	161 ST	134 AVE	NaN	...	Unspecified	Unspecified	Unspecified	NaN	4839302	Sedan	Sedan	Sedan	Sedan	NaN
9	08/31/2025	10:25	QUEENS	11004.0	40.736750	-73.711030	(40.73675, -73.71103)	HILLSIDE AVE	257 ST	NaN	...	Unspecified	NaN	NaN	NaN	4838641	Station Wagon/Sport Utility Vehicle	Sedan	NaN	NaN	NaN

10 rows × 29 columns

Importing the file is only the first step. After loading the dataset, it is important to verify its structure before proceeding with analysis.

6.1.3.2 Verifying the Structure

Importing a dataset doesn’t guarantee that it was interpreted correctly. After immediately loading the file, the structure should be verified. So, checking the data types of each column is critical and is completed using the dtypes attribute.

df.shape

(1487, 29)

The shape attribute will confirm the number of rows and columns. This helps detect incomplete or truncated files.

df.dtypes

CRASH DATE                        object
CRASH TIME                        object
BOROUGH                           object
ZIP CODE                         float64
LATITUDE                         float64
LONGITUDE                        float64
LOCATION                          object
ON STREET NAME                    object
CROSS STREET NAME                 object
OFF STREET NAME                   object
NUMBER OF PERSONS INJURED          int64
NUMBER OF PERSONS KILLED           int64
NUMBER OF PEDESTRIANS INJURED      int64
NUMBER OF PEDESTRIANS KILLED       int64
NUMBER OF CYCLIST INJURED          int64
NUMBER OF CYCLIST KILLED           int64
NUMBER OF MOTORIST INJURED         int64
NUMBER OF MOTORIST KILLED          int64
CONTRIBUTING FACTOR VEHICLE 1     object
CONTRIBUTING FACTOR VEHICLE 2     object
CONTRIBUTING FACTOR VEHICLE 3     object
CONTRIBUTING FACTOR VEHICLE 4     object
CONTRIBUTING FACTOR VEHICLE 5     object
COLLISION_ID                       int64
VEHICLE TYPE CODE 1               object
VEHICLE TYPE CODE 2               object
VEHICLE TYPE CODE 3               object
VEHICLE TYPE CODE 4               object
VEHICLE TYPE CODE 5               object
dtype: object

The dtypes attribute shows how pandas interpreted each column. This is important because data type determines how a variable behaves during filtering and modeling.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1487 entries, 0 to 1486
Data columns (total 29 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   CRASH DATE                     1487 non-null   object 
 1   CRASH TIME                     1487 non-null   object 
 2   BOROUGH                        1203 non-null   object 
 3   ZIP CODE                       1203 non-null   float64
 4   LATITUDE                       1475 non-null   float64
 5   LONGITUDE                      1475 non-null   float64
 6   LOCATION                       1475 non-null   object 
 7   ON STREET NAME                 1031 non-null   object 
 8   CROSS STREET NAME              900 non-null    object 
 9   OFF STREET NAME                456 non-null    object 
 10  NUMBER OF PERSONS INJURED      1487 non-null   int64  
 11  NUMBER OF PERSONS KILLED       1487 non-null   int64  
 12  NUMBER OF PEDESTRIANS INJURED  1487 non-null   int64  
 13  NUMBER OF PEDESTRIANS KILLED   1487 non-null   int64  
 14  NUMBER OF CYCLIST INJURED      1487 non-null   int64  
 15  NUMBER OF CYCLIST KILLED       1487 non-null   int64  
 16  NUMBER OF MOTORIST INJURED     1487 non-null   int64  
 17  NUMBER OF MOTORIST KILLED      1487 non-null   int64  
 18  CONTRIBUTING FACTOR VEHICLE 1  1478 non-null   object 
 19  CONTRIBUTING FACTOR VEHICLE 2  1132 non-null   object 
 20  CONTRIBUTING FACTOR VEHICLE 3  129 non-null    object 
 21  CONTRIBUTING FACTOR VEHICLE 4  40 non-null     object 
 22  CONTRIBUTING FACTOR VEHICLE 5  13 non-null     object 
 23  COLLISION_ID                   1487 non-null   int64  
 24  VEHICLE TYPE CODE 1            1470 non-null   object 
 25  VEHICLE TYPE CODE 2            1012 non-null   object 
 26  VEHICLE TYPE CODE 3            124 non-null    object 
 27  VEHICLE TYPE CODE 4            35 non-null     object 
 28  VEHICLE TYPE CODE 5            13 non-null     object 
dtypes: float64(3), int64(9), object(17)
memory usage: 337.0+ KB

The info() function provides a summary of the counts and data types. This helps identify columns with missing values and detect unexpected type assignments.

6.1.4 Common Mistakes When Importing CSV Files

By importing a dataset, it doesn’t guarantee that it was interpreted correctly. Many issues occur silently, meaning the dataset loads without error but behaves incorrectly during analysis.

The following examples demonstrate common mistakes and how to identify and fix them.

6.1.4.1 Incorrect File Path

A very common issue that arises when importing data is an incorrect file path.

try:
    pd.read_csv("nyc_crashes_lbdwk_2025.csv")
except FileNotFoundError as e:
    print("FileNotFoundError:", e)

FileNotFoundError: [Errno 2] No such file or directory: 'nyc_crashes_lbdwk_2025.csv'

This error occurs because Python searches for the file in the current working directory. Since the file is stored inside the data folder, the relative path must include that folder name.

In order to confirm the working directory:

import os
os.getcwd()

'/Users/kun.chen/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26'

The working directory determines where Python searches for files when using relative paths. If the file is stored inside a folder, the path must reflect that structure.

6.1.4.2 Fixing the Path

The correct path to the file would be as follows:

pd.read_csv("data/nyc_crashes_lbdwk_2025.csv").head()

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME	CROSS STREET NAME	OFF STREET NAME	...	CONTRIBUTING FACTOR VEHICLE 2	CONTRIBUTING FACTOR VEHICLE 3	CONTRIBUTING FACTOR VEHICLE 4	CONTRIBUTING FACTOR VEHICLE 5	COLLISION_ID	VEHICLE TYPE CODE 1	VEHICLE TYPE CODE 2	VEHICLE TYPE CODE 3	VEHICLE TYPE CODE 4	VEHICLE TYPE CODE 5
0	08/31/2025	12:49	QUEENS	11101.0	40.753113	-73.933700	(40.753113, -73.9337)	30 ST	39 AVE	NaN	...	NaN	NaN	NaN	NaN	4838875	Station Wagon/Sport Utility Vehicle	NaN	NaN	NaN	NaN
1	08/31/2025	15:30	MANHATTAN	10022.0	40.760600	-73.964320	(40.7606, -73.96432)	E 59 ST	2 AVE	NaN	...	NaN	NaN	NaN	NaN	4839110	Station Wagon/Sport Utility Vehicle	NaN	NaN	NaN	NaN
2	08/31/2025	19:00	NaN	NaN	40.734234	-73.722750	(40.734234, -73.72275)	CROSS ISLAND PARKWAY	HILLSIDE AVENUE	NaN	...	Unspecified	Unspecified	NaN	NaN	4838966	Sedan	Sedan	NaN	NaN	NaN
3	08/31/2025	1:19	BROOKLYN	11220.0	40.648075	-74.007034	(40.648075, -74.007034)	NaN	NaN	4415 5 AVE	...	Unspecified	NaN	NaN	NaN	4838563	Sedan	E-Bike	NaN	NaN	NaN
4	08/31/2025	2:41	MANHATTAN	10036.0	40.756560	-73.986110	(40.75656, -73.98611)	W 43 ST	BROADWAY	NaN	...	Unspecified	NaN	NaN	NaN	4838922	Station Wagon/Sport Utility Vehicle	Bike	NaN	NaN	NaN

5 rows × 29 columns

6.1.4.3 Wrong Delimiter

In some files, the separator between values is not a comma. If the delimiter is incorrect, the entire dataset may appear as a single column. In those cases, the correct separator must be specified.

# Example of incorrect delimiter
df_wrong = pd.read_csv("data/nyc_crashes_lbdwk_2025.csv", sep=";")
df_wrong.head(10)

	CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,NUMBER OF PERSONS INJURED,NUMBER OF PERSONS KILLED,NUMBER OF PEDESTRIANS INJURED,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST INJURED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST INJURED,NUMBER OF MOTORIST KILLED,CONTRIBUTING FACTOR VEHICLE 1,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
0	08/31/2025,12:49,QUEENS,11101,40.753113,-73.93...
1	08/31/2025,15:30,MANHATTAN,10022,40.7606,-73.9...
2	08/31/2025,19:00,,,40.734234,-73.72275,"(40.73...
3	08/31/2025,1:19,BROOKLYN,11220,40.648075,-74.0...
4	08/31/2025,2:41,MANHATTAN,10036,40.75656,-73.9...
5	08/31/2025,19:42,BRONX,10466,40.887672,-73.847...
6	08/31/2025,8:21,BROOKLYN,11221,40.689217,-73.9...
7	08/31/2025,2:50,QUEENS,11411,40.68599,-73.7287...
8	08/31/2025,21:45,QUEENS,11434,40.673023,-73.77...
9	08/31/2025,10:25,QUEENS,11004,40.73675,-73.711...

In this case, the entire dataset loads into a single column. This happens because the file actually uses commas, not semicolons.

The correct import removes the incorrect separator argument:

df_correct = pd.read_csv("data/nyc_crashes_lbdwk_2025.csv")
df_correct.head()

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME	CROSS STREET NAME	OFF STREET NAME	...	CONTRIBUTING FACTOR VEHICLE 2	CONTRIBUTING FACTOR VEHICLE 3	CONTRIBUTING FACTOR VEHICLE 4	CONTRIBUTING FACTOR VEHICLE 5	COLLISION_ID	VEHICLE TYPE CODE 1	VEHICLE TYPE CODE 2	VEHICLE TYPE CODE 3	VEHICLE TYPE CODE 4	VEHICLE TYPE CODE 5
0	08/31/2025	12:49	QUEENS	11101.0	40.753113	-73.933700	(40.753113, -73.9337)	30 ST	39 AVE	NaN	...	NaN	NaN	NaN	NaN	4838875	Station Wagon/Sport Utility Vehicle	NaN	NaN	NaN	NaN
1	08/31/2025	15:30	MANHATTAN	10022.0	40.760600	-73.964320	(40.7606, -73.96432)	E 59 ST	2 AVE	NaN	...	NaN	NaN	NaN	NaN	4839110	Station Wagon/Sport Utility Vehicle	NaN	NaN	NaN	NaN
2	08/31/2025	19:00	NaN	NaN	40.734234	-73.722750	(40.734234, -73.72275)	CROSS ISLAND PARKWAY	HILLSIDE AVENUE	NaN	...	Unspecified	Unspecified	NaN	NaN	4838966	Sedan	Sedan	NaN	NaN	NaN
3	08/31/2025	1:19	BROOKLYN	11220.0	40.648075	-74.007034	(40.648075, -74.007034)	NaN	NaN	4415 5 AVE	...	Unspecified	NaN	NaN	NaN	4838563	Sedan	E-Bike	NaN	NaN	NaN
4	08/31/2025	2:41	MANHATTAN	10036.0	40.756560	-73.986110	(40.75656, -73.98611)	W 43 ST	BROADWAY	NaN	...	Unspecified	NaN	NaN	NaN	4838922	Station Wagon/Sport Utility Vehicle	Bike	NaN	NaN	NaN

5 rows × 29 columns

6.1.4.4 Unexpected Data Types

Numeric columns may be interpreted as object type if they contain inconsistent values or non-numeric characters.

df["ZIP CODE"].dtype

dtype('float64')

To convert the column safely:

df["ZIP CODE"] = pd.to_numeric(df["ZIP CODE"], errors="coerce")
df["ZIP CODE"].dtype

dtype('float64')

The argument errors=“coerce” converts invalid values into missing values rather than raising an error.

6.1.4.5 Missing Values

Missing values affect summary statistics and modeling decisions. They should be inspected early.

df.isnull().sum()

CRASH DATE                          0
CRASH TIME                          0
BOROUGH                           284
ZIP CODE                          284
LATITUDE                           12
LONGITUDE                          12
LOCATION                           12
ON STREET NAME                    456
CROSS STREET NAME                 587
OFF STREET NAME                  1031
NUMBER OF PERSONS INJURED           0
NUMBER OF PERSONS KILLED            0
NUMBER OF PEDESTRIANS INJURED       0
NUMBER OF PEDESTRIANS KILLED        0
NUMBER OF CYCLIST INJURED           0
NUMBER OF CYCLIST KILLED            0
NUMBER OF MOTORIST INJURED          0
NUMBER OF MOTORIST KILLED           0
CONTRIBUTING FACTOR VEHICLE 1       9
CONTRIBUTING FACTOR VEHICLE 2     355
CONTRIBUTING FACTOR VEHICLE 3    1358
CONTRIBUTING FACTOR VEHICLE 4    1447
CONTRIBUTING FACTOR VEHICLE 5    1474
COLLISION_ID                        0
VEHICLE TYPE CODE 1                17
VEHICLE TYPE CODE 2               475
VEHICLE TYPE CODE 3              1363
VEHICLE TYPE CODE 4              1452
VEHICLE TYPE CODE 5              1474
dtype: int64

Checking missing values early helps identify which variables may require cleaning before modeling.

6.1.5 Importing Other Common Formats

While CSV is the most common format, pandas supports many other file types used in practice.

6.1.5.1 Excel Files

Excel files are frequently used in business settings. They can be imported using read_excel().

df_excel = pd.read_excel("data/example.xlsx", sheet_name="Sheet1")

Excel files may contain multiple sheets, so specifying sheet_name is often necessary. Similar to CSV, the structure should be verified immediately after import using .shape, .info(), and .dtypes.

6.1.5.2 JSON Files

JSON files are commonly used in web applications and APIs. They can be imported using read_json().

df_json = pd.read_json("data/example.json")

Unlike CSV and Excel, JSON files may contain nested structures. In those cases, additional processing such as normalization may be required before analysis.

6.1.6 Comparing CSV and Feather Formats

CSV is the most common format for tabular data, but it is not always the most efficient. Feather is a binary format designed for speed and type preservation.

Feather files require the pyarrow dependency. In some environments, attempting to read a Feather file without pyarrow installed will result in an ImportError. This highlights an important reproducibility lesson: certain file formats require additional dependencies that must be installed before use.

Feather files are optimized for performance and type preservation, making them well-suited for large-scale workflows. However, because they are binary and not human-readable, CSV files remain more portable and easier to inspect manually.

6.1.7 Exporting Data

After verifying and cleaning the dataset, the results can be exported for reuse or collaboration. Exporting ensures that the cleaned version of the data can be shared or reused without repeating the cleaning steps.

For example, the cleaned DataFrame can be exported to a new CSV file.

df.to_csv("data/nyc_crashes_cleaned_export.csv", index=False)

This command creates a new CSV file containing the cleaned dataset. The argument index=False prevents pandas from writing the DataFrame index as an extra column. Forgetting this argument often results in unwanted columns such as “Unnamed: 0” when the file is imported again.

6.1.7.1 Exporting Figures

import matplotlib.pyplot as plt
import os

plt.figure()
df["ZIP CODE"].value_counts().head(10).plot(kind="bar")

plt.title("Top ZIP Codes by Crash Count")
plt.ylabel("Count")
plt.xticks(rotation=45)

os.makedirs("figures", exist_ok=True)

plt.savefig("figures/top_zipcodes.png", dpi=300, bbox_inches="tight")
plt.show()
plt.close()

The savefig() function saves the visualization to a file:

dpi=300 ensures high resolution for reports.
bbox_inches="tight" removes excess whitespace.
Saving figures to a figures/ folder improves project organization.

6.1.8 Best Practices for Importing and Exporting Data

The following practices help prevent silent errors:

Inspect the shape, data types, and missing values immediately after importing a dataset.
Use relative file paths within a repository to maintain portability.
Parse date columns during import instead of converting them later.
Avoid manually editing raw data files after they are stored.
Separate raw data from processed outputs within the project directory.
Always verify structure before beginning analysis.

6.1.9 Further Readings

https://realpython.com/python-csv/
https://miamioh.edu/centers-institutes/center-for-analytics-data-science/students/coding-tutorials/python/reading-data-and-writing-files.html
https://pandas.pydata.org/docs/user_guide/10min.html
https://www.sqlshack.com/exporting-data-with-pandas-in-python/
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html

6.1.10 Thank you!

6.1.10.1 Questions?

6.2 Database Operation with Structured Query Language(SQL)

This presentation was prepared by Shreya Sudarsanam.

6.2.1 How To Use SQL In Python With Pandas Using a SQL-Created Table

First, we will import pandas
Then, we will import sqlite3

import pandas as pd
import sqlite3

Next, we will create a connection to a sql database using the command, sqlite3.connect(“name_of_database”), and store this in a variable.

sql_database_connection = sqlite3.connect(':memory:')

Now, we will explore the cursor.() command:

#A cursor variable will be created that will help run all SQL queries
cursor = sql_database_connection.cursor()
#Executes the SQL query to create the table
cursor.execute('CREATE TABLE lemon_types_new (organic_or_not TEXT, GMO_or_NOT TEXT)')
#Executes the SQL query to insert into the table
cursor.execute('INSERT INTO lemon_types_new VALUES ("Yes", "Yes")')

<sqlite3.Cursor at 0x13260d6c0>

To store your sql query, create a variable.

sql_query = """SELECT * FROM lemon_types_new"""

To execute the sql_query, create a data frame using pandas. Then, use the function, .read_sql_query, to read the sql_query. The two arguments of the function are the sql_query and the sql_database_connection.

pandas_data_frame = pd.read_sql_query(sql_query, sql_database_connection)
print(pandas_data_frame)

  organic_or_not GMO_or_NOT
0            Yes        Yes

6.2.2 How To Use SQL In Python With Pandas Data Frame

First, we will import pandas
Then, we will import sqlite3

import pandas as pd
import sqlite3

Next, we create a pandas data frame with stored values:

pandas_data_frame_with_values = pd.DataFrame({"owners_of_lemonade_stands": ["Kathy", "Lily", "Lavender", "Samantha", "Rose", "Rebecca"],
                                             "week_one_sales_dollars": [100, 120, 300, 650, 200, 210],
                                             "week_two_sales_dollars": [120, 140, 380, 780, 220, 260],
                                             "week_three_sales_dollars": [20, 40, 80, 100, 50, 60],
                                             "week_four_sales_dollars": [200, 240, 600, 1310, 400, 420],
                                             "temperature_degrees_fareheit": [60, 70, 30, 80, 100, 120]})

Next, We will create a connection to a sql database using the command, sqlite3.connect(“name_of_database”), and store this in a variable.

sql_database_connection = sqlite3.connect("lemonade_stand_business.db")

Important note: If the database has not been created yet, the sqlite3.connect() command will create a new database.

“To convert the data frame to a table in sql, use the command, df.to_sql(). This command should have THREE arguments:
The first argument is name, which specifies the name of the dataframe.
The second argument is con, which specifies the connection to a specific database.
The third argument, which is optional, is if_exists=REPLACE. This argument makes sure that if there are two sql tables with the same name, the older sql tale will be replaced by the newer sql table.
The last argument, whcih is also optional, is index = (TRUE, FALSE). If index = TRUE, this creates a numerical index for the data frame. However, if index = FALSE, there will be no numerical index applied to the data frame.

pandas_data_frame_with_values.to_sql("lemonade_stand_data", sql_database_connection, if_exists="replace", index=True)

To store your sql query, create a variable.

sql_query_new = """SELECT * from lemonade_stand_data"""

To create a data frame that captures a view of your original SQL table, create a new data frame variable, and apply the .read_sql_query() function:

data_frame_select_all = pd.read_sql_query(sql_query_new, sql_database_connection)
print(data_frame_select_all)

   index owners_of_lemonade_stands  week_one_sales_dollars  \
0      0                     Kathy                     100   
1      1                      Lily                     120   
2      2                  Lavender                     300   
3      3                  Samantha                     650   
4      4                      Rose                     200   
5      5                   Rebecca                     210   

   week_two_sales_dollars  week_three_sales_dollars  week_four_sales_dollars  \
0                     120                        20                      200   
1                     140                        40                      240   
2                     380                        80                      600   
3                     780                       100                     1310   
4                     220                        50                      400   
5                     260                        60                      420   

   temperature_degrees_fareheit  
0                            60  
1                            70  
2                            30  
3                            80  
4                           100  
5                           120

6.2.3 Different Ways To Establish SQL Database Connections

SQL Database Connections can be establish two ways: You can either: * create a temporary database in memory * or you can create a named database.

Example of creating a temporary database in memory:

sql_database_connection = sqlite3.connect(':memory:')

Example of creating a named database that can be saved:

sql_database_connection = sqlite3.connect("lemonade_stand_business.db")

6.2.4 How To Use SQLite 3 In Python Without Pandas

Import sqlite3

import sqlite3

Establish connection to a sql database. The following example establishes a connection to a temporary, in-memory database:

sql_database_connection = sqlite3.connect(':memory:')

Next, create an object, called cursor, to run SQL queries. This object is a variable that represents the .cursor() attribute of the sql database connection.

cursor = sql_database_connection.cursor()

Afterwards, execute a SQL command using cursor.execute(“SQL command”)

cursor.execute('CREATE TABLE new_lemonade_stand (owner TEXT, weekly_sales_avg INT)')

<sqlite3.Cursor at 0x13270c3c0>

Creates a table named lemonade_stand
The table has columns named owner and weekly_sales_avg
The data type for the owner column is TEXT
The data type for the weekly_sales_avg column is INT

cursor.execute('INSERT INTO new_lemonade_stand VALUES ("Lavender", 300)')

<sqlite3.Cursor at 0x13270c3c0>

Adds data to the lemonade_stand table, using the command, INSERT INTO
It is important to save any alterations to the database by using a special sql_database_connection.commit() command.

sql_database_connection.commit()

To view the result of a SQL query, it is necessary to perform the cursor.fetchall() function:

cursor.execute('SELECT * FROM new_lemonade_stand')
sql_table_result = cursor.fetchall()
print(sql_table_result)

[('Lavender', 300)]

6.2.5 How to display cursor.fetchall() results in a pandas data frame

cursor.execute('SELECT * FROM new_lemonade_stand')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

      owner  weekly_sales_avg
0  Lavender               300

6.2.6 Types of Commands In SQL

6.2.6.1 Data Definition Language (DDL):

1. SQL statements that name, change the structuer of, and eliminate objects in a database

6.2.6.2 Data Control Language (DCL):

1. SQL statements that control who has access to database data.

6.2.6.3 Data Manipulation Language (DML):

1. SQL statements that do not impact database structure. Instead, these statements change/access objects inside a database.

6.2.6.4 Data Transaction Control (DTC):

4.Track how Data Manipulation Language commands change objects in a database by logging these changes as transactions.

6.2.7 Exploring Types of Data Manipulation Language (DML) Commands

1. Selection: Involves selecting specific rows from a table
1. Projection: Involves selecting specific columns from a table
1. Joining: Involves joining tables together

6.2.7.1 Exploring SELECTION with organic_lemonade_stand data table:

The below code helps create the organic_lemonade_stand data table:

cursor.execute('CREATE TABLE organic_lemonade_stand (owner TEXT, location TEXT, weekly_expenses_avg INT, weekly_sales_avg INT)')
cursor.execute('INSERT INTO organic_lemonade_stand VALUES ("Lily", "Florida", 100, 500)')
cursor.execute('INSERT INTO organic_lemonade_stand VALUES ("Summer", "California", 300, 800)')
cursor.execute('INSERT INTO organic_lemonade_stand VALUES ("Layla", "New York", 200, 250)')
cursor.execute('INSERT INTO organic_lemonade_stand VALUES ("Beatrice", "Alaska", 50, 60)')
cursor.execute('INSERT INTO organic_lemonade_stand VALUES ("Carla", "Massachusetts", 250, 300)')
sql_database_connection.commit()

The below code helps select all rows in the organic_lemonade_stand data table:

cursor.execute('SELECT * FROM organic_lemonade_stand')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

      owner       location  weekly_expenses_avg  weekly_sales_avg
0      Lily        Florida                  100               500
1    Summer     California                  300               800
2     Layla       New York                  200               250
3  Beatrice         Alaska                   50                60
4     Carla  Massachusetts                  250               300

6.2.7.2 Exploring PROJECTION with organic_lemonade_stand database

The owner column is being selected

cursor.execute('SELECT owner FROM organic_lemonade_stand')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

      owner
0      Lily
1    Summer
2     Layla
3  Beatrice
4     Carla

6.2.7.3 Exploring SELECTION & PROJECTION combined

The owner column is being selected (projection), and rows where weekly_expenses_avg is greater than 100 is also selected (selection).

cursor.execute('SELECT owner, weekly_expenses_avg FROM organic_lemonade_stand WHERE weekly_expenses_avg > 100')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

    owner  weekly_expenses_avg
0  Summer                  300
1   Layla                  200
2   Carla                  250

6.2.7.4 Exploring JOINING with organic_lemonade_stand data table and sparkling_lemon_water data table

The below code creates the sparkling_lemon_water data table:

cursor.execute('CREATE TABLE sparkling_lemon_water (owner TEXT, location TEXT, weekly_expenses_avg INT, weekly_sales_avg INT)')
cursor.execute('INSERT INTO sparkling_lemon_water VALUES ("Daisy", "Florida", 20, 65)')
cursor.execute('INSERT INTO sparkling_lemon_water VALUES ("Kathy", "California", 65, 160)')
cursor.execute('INSERT INTO sparkling_lemon_water VALUES ("Samantha", "New York", 85, 280)')
cursor.execute('INSERT INTO sparkling_lemon_water VALUES ("Samantha", "New York", 85, 280)')
sql_database_connection.commit()

6.2.7.4.1 INNER JOIN: A type of SQL join where rows that are common in both tables are presented in the output.

cursor.execute('SELECT * FROM organic_lemonade_stand INNER JOIN sparkling_lemon_water ON organic_lemonade_stand.location = sparkling_lemon_water.location')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

    owner    location  weekly_expenses_avg  weekly_sales_avg     owner  \
0    Lily     Florida                  100               500     Daisy   
1  Summer  California                  300               800     Kathy   
2   Layla    New York                  200               250  Samantha   
3   Layla    New York                  200               250  Samantha   

     location  weekly_expenses_avg  weekly_sales_avg  
0     Florida                   20                65  
1  California                   65               160  
2    New York                   85               280  
3    New York                   85               280

6.2.7.4.2 LEFT JOIN: A type of SQL join where every row in the left table is shown in the output, along with rows in the right table that match the left table

cursor.execute('SELECT * FROM organic_lemonade_stand LEFT JOIN sparkling_lemon_water ON organic_lemonade_stand.location = sparkling_lemon_water.location')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

      owner       location  weekly_expenses_avg  weekly_sales_avg     owner  \
0      Lily        Florida                  100               500     Daisy   
1    Summer     California                  300               800     Kathy   
2     Layla       New York                  200               250  Samantha   
3     Layla       New York                  200               250  Samantha   
4  Beatrice         Alaska                   50                60      None   
5     Carla  Massachusetts                  250               300      None   

     location  weekly_expenses_avg  weekly_sales_avg  
0     Florida                 20.0              65.0  
1  California                 65.0             160.0  
2    New York                 85.0             280.0  
3    New York                 85.0             280.0  
4        None                  NaN               NaN  
5        None                  NaN               NaN

6.2.7.4.3 RIGHT JOIN: A type of SQL join where every row in the right table is shown in the output, along withh rows in the left table that match the right table.

cursor.execute('SELECT * FROM organic_lemonade_stand RIGHT JOIN sparkling_lemon_water ON organic_lemonade_stand.location = sparkling_lemon_water.location')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

    owner    location  weekly_expenses_avg  weekly_sales_avg     owner  \
0    Lily     Florida                  100               500     Daisy   
1  Summer  California                  300               800     Kathy   
2   Layla    New York                  200               250  Samantha   
3   Layla    New York                  200               250  Samantha   

     location  weekly_expenses_avg  weekly_sales_avg  
0     Florida                   20                65  
1  California                   65               160  
2    New York                   85               280  
3    New York                   85               280

6.2.8 Using Aliases In DML Statements

To reference a table, it may be easier to “alias” the table or shorten the table’s name, to make the code more readable. Below is the JOINING command, but with table aliases:

cursor.execute('SELECT * FROM organic_lemonade_stand lemon INNER JOIN sparkling_lemon_water sparkling ON lemon.location = sparkling.location')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

    owner    location  weekly_expenses_avg  weekly_sales_avg     owner  \
0    Lily     Florida                  100               500     Daisy   
1  Summer  California                  300               800     Kathy   
2   Layla    New York                  200               250  Samantha   
3   Layla    New York                  200               250  Samantha   

     location  weekly_expenses_avg  weekly_sales_avg  
0     Florida                   20                65  
1  California                   65               160  
2    New York                   85               280  
3    New York                   85               280

6.2.9 Using DISTINCT in DML

The sparkling_lemon_water contains duplicate rows for the owner, Samantha. To only return one distinct row, we need to use the DISTINCT command.

cursor.execute('SELECT DISTINCT owner, location, weekly_expenses_avg, weekly_sales_avg FROM sparkling_lemon_water')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

      owner    location  weekly_expenses_avg  weekly_sales_avg
0     Daisy     Florida                   20                65
1     Kathy  California                   65               160
2  Samantha    New York                   85               280

6.2.10 Logical Operators

1. AND: If both conditions are TRUE, the entire expression is TRUE
1. OR: If at least one condition is TRUE, the entire expression is TRUE
1. NOT: Returns TRUE if NOT FALSE, and Returns FALSE if NOT TRUE

6.2.10.1 Example of AND - Logical Operators

cursor.execute('SELECT * FROM organic_lemonade_stand WHERE weekly_expenses_avg < 300 AND weekly_sales_avg > 200')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

   owner       location  weekly_expenses_avg  weekly_sales_avg
0   Lily        Florida                  100               500
1  Layla       New York                  200               250
2  Carla  Massachusetts                  250               300

6.2.10.2 Example of OR - Logical Operators

cursor.execute('SELECT * FROM organic_lemonade_stand WHERE weekly_expenses_avg < 300 OR weekly_sales_avg > 500')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

      owner       location  weekly_expenses_avg  weekly_sales_avg
0      Lily        Florida                  100               500
1    Summer     California                  300               800
2     Layla       New York                  200               250
3  Beatrice         Alaska                   50                60
4     Carla  Massachusetts                  250               300

6.2.10.3 Example of NOT - Logical Operators

cursor.execute('SELECT * FROM organic_lemonade_stand WHERE NOT weekly_expenses_avg < 100')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

    owner       location  weekly_expenses_avg  weekly_sales_avg
0    Lily        Florida                  100               500
1  Summer     California                  300               800
2   Layla       New York                  200               250
3   Carla  Massachusetts                  250               300

6.2.11 Arithmetic Operators

We can create a new alias, weekly_profits_avg, by subtracting the weekly_expenses_avg column from the weekly_sales_column, using arithmetic operators.

cursor.execute('SELECT weekly_sales_avg - weekly_expenses_avg AS weekly_profits_avg FROM organic_lemonade_stand')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

   weekly_profits_avg
0                 400
1                 500
2                  50
3                  10
4                  50

6.2.12 Concatenation Operators

We can concatenate two columns together using a special concatenation operator:

cursor.execute("SELECT owner||': '||location AS 'owner location' FROM organic_lemonade_stand")
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

         owner location
0         Lily: Florida
1    Summer: California
2       Layla: New York
3      Beatrice: Alaska
4  Carla: Massachusetts

In this example, the owner and location columns are being concatenated to form the owner location column.

6.2.13 ORDER BY and DESC

We can find what owner is the most profitable by using the ORDER BY operator. ORDER BY always orderes entries from least to greatest. However, ORDER BY (insert name of column) DESC to order the columns from greatest to least.

ORDER BY Example:

cursor.execute('SELECT * FROM organic_lemonade_stand ORDER BY weekly_expenses_avg')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

      owner       location  weekly_expenses_avg  weekly_sales_avg
0  Beatrice         Alaska                   50                60
1      Lily        Florida                  100               500
2     Layla       New York                  200               250
3     Carla  Massachusetts                  250               300
4    Summer     California                  300               800

ORDER BY DESC Example:

cursor.execute('SELECT * FROM organic_lemonade_stand ORDER BY weekly_expenses_avg DESC')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

      owner       location  weekly_expenses_avg  weekly_sales_avg
0    Summer     California                  300               800
1     Carla  Massachusetts                  250               300
2     Layla       New York                  200               250
3      Lily        Florida                  100               500
4  Beatrice         Alaska                   50                60

6.2.14 Set Operators

1. UNION: Produces a set combining outputs created by 2 SELECT queries. Any rows that are duplicated are not part of the resulting set.
1. UNION ALL: Similar to UNION, this produces a set combining outputs created by 2 SELECT queries. However, any rows that are duplicated are displayed in the resulting set.
1. INTERSECT: Finds rows that are shared by both tables/SELECT queries. The output contains distinct rows.
1. MINUS/EXCEPT: Finds rows that the first SELECT query has but the second SELECT query does not have. These rows are distinct. Note that MINUS is not compatible with SQLITE, and is only compatible with ORACLE

Below is code that creates the numerical_data table:

cursor.execute('CREATE TABLE numerical_data (numbers INT)')
cursor.execute('INSERT INTO numerical_data VALUES (1)')
cursor.execute('INSERT INTO numerical_data VALUES (1)')
cursor.execute('INSERT INTO numerical_data VALUES (1)')
cursor.execute('INSERT INTO numerical_data VALUES (2)')
cursor.execute('INSERT INTO numerical_data VALUES (3)')
cursor.execute('INSERT INTO numerical_data VALUES (4)')
cursor.execute('INSERT INTO numerical_data VALUES (10)')

<sqlite3.Cursor at 0x13270c3c0>

Below is code that creates the cool_numbers_data table:

cursor.execute('CREATE TABLE cool_numbers_data (cool_numbers INT)')
cursor.execute('INSERT INTO cool_numbers_data VALUES (1)')
cursor.execute('INSERT INTO cool_numbers_data VALUES (2)')
cursor.execute('INSERT INTO cool_numbers_data VALUES (3)')
cursor.execute('INSERT INTO cool_numbers_data VALUES (4)')
cursor.execute('INSERT INTO cool_numbers_data VALUES (5)')

<sqlite3.Cursor at 0x13270c3c0>

Example of the UNION OPERATOR: Finds the union of the data tables, numerical_data and cool_numbers_data:

cursor.execute('SELECT numbers FROM numerical_data UNION SELECT cool_numbers FROM cool_numbers_data')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

Example of the UNION ALL OPERATOR: Finds the UNION ALL of the data tables, numerical_data and cool_numbers_data:

cursor.execute('SELECT numbers FROM numerical_data UNION ALL SELECT cool_numbers FROM cool_numbers_data')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

    numbers
0         1
1         1
2         1
3         2
4         3
5         4
6        10
7         1
8         2
9         3
10        4
11        5

Example of the INTERSECT OPERATOR: Finds the intersection of the data tables, numerical_data and cool_numbers_data:

cursor.execute('SELECT numbers FROM numerical_data INTERSECT SELECT cool_numbers FROM cool_numbers_data')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

Example of the EXCEPT OPERATOR: Finds the execption of the data tables, numerical_data and cool_numbers_data:

cursor.execute('SELECT numbers FROM numerical_data EXCEPT SELECT cool_numbers FROM cool_numbers_data')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

   numbers
0       10

6.2.15 Subqueries

Subqueries are smaller queries within a larger query. When subqueries are evaluated, they help filter the data, and this filter can applied to the larger query.

6.2.15.1 Two Types of Subqueries

The two types of subqueries are Inline Views and Nested Subqueries

6.2.15.1.1 Inline Views

When a SELECT statement has a FROM cause that contains a subquery, this is called an Inline View.

Example:

cursor.execute('SELECT owner, weekly_sales_avg, weekly_profit_avg/5 AS daily_profit_avg FROM (SELECT owner, weekly_sales_avg, weekly_sales_avg - weekly_expenses_avg AS weekly_profit_avg FROM organic_lemonade_stand) ORDER BY daily_profit_avg')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

      owner  weekly_sales_avg  daily_profit_avg
0  Beatrice                60                 2
1     Layla               250                10
2     Carla               300                10
3      Lily               500                80
4    Summer               800               100

6.2.15.1.2 Nested Subquery

When a SELECT statement has a WHERE clause that contains a subquery, this is called a Nested Subquery.

Example 1:

cursor.execute('SELECT owner, weekly_expenses_avg, weekly_sales_avg FROM organic_lemonade_stand WHERE weekly_sales_avg > (SELECT weekly_sales_avg FROM sparkling_lemon_water)')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

    owner  weekly_expenses_avg  weekly_sales_avg
0    Lily                  100               500
1  Summer                  300               800
2   Layla                  200               250
3   Carla                  250               300

Example 2:

cursor.execute('SELECT owner, weekly_expenses_avg, weekly_sales_avg FROM organic_lemonade_stand WHERE weekly_sales_avg < (SELECT weekly_sales_avg FROM sparkling_lemon_water)')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

      owner  weekly_expenses_avg  weekly_sales_avg
0  Beatrice                   50                60

6.2.15.2 Classes of Subqueries

Subqueries can be categorized further into three classes.
The three classes of Subqueries are as follows: Single-row subqueries, Multiple-row subqueries, and Correlated subqueries

6.2.15.2.1 Single Row Subqueries

Single-row subqueries: Subqueries that output a single row, which is then evaluated by the outer query

Example:

cursor.execute('SELECT owner, weekly_sales_avg, weekly_expenses_avg FROM organic_lemonade_stand WHERE weekly_expenses_avg <= 300 AND weekly_sales_avg > (SELECT AVG(weekly_sales_avg) AS mean_of_weekly_sales_avg FROM organic_lemonade_stand)')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

    owner  weekly_sales_avg  weekly_expenses_avg
0    Lily               500                  100
1  Summer               800                  300

6.2.15.2.2 Multiple-Row Subqueries

1. Multiple-row subqueries: Subqueries that output many rows, which are then evluated by the outer query

Example:

cursor.execute('CREATE TABLE successful_organic_business_owners (successful_owner_name TEXT, weekly_revenue INT)')
sql_database_connection.commit()

cursor.execute('INSERT INTO successful_organic_business_owners SELECT owner, weekly_sales_avg - weekly_expenses_avg AS weekly_profits_avg FROM organic_lemonade_stand WHERE weekly_sales_avg - weekly_expenses_avg > 200')
cursor.execute('SELECT * FROM successful_organic_business_owners')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

  successful_owner_name  weekly_revenue
0                  Lily             400
1                Summer             500

6.2.15.2.3 Correlated Subqueries

1. Correlated subqueries: Subqueries that depend on the outer query because they reference column information from the outer query

Example:

cursor.execute('SELECT weekly_sales_avg FROM organic_lemonade_stand WHERE weekly_sales_avg > (SELECT weekly_sales_avg FROM sparkling_lemon_water WHERE organic_lemonade_stand.weekly_expenses_avg > sparkling_lemon_water.weekly_expenses_avg)')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)

   weekly_sales_avg
0               500
1               800
2               250
3               300

6.2.16 Data Visualization Using SQL

How to Present SQL table result using matplotlib:

First, import matplotlib

import matplotlib.pyplot as plt

Create a SQL query that you would like to visualize via matplotlib

data_frame_to_be_visualized = pd.read_sql_query("""SELECT location, weekly_expenses_avg FROM organic_lemonade_stand""", sql_database_connection)
print(data_frame_to_be_visualized)

        location  weekly_expenses_avg
0        Florida                  100
1     California                  300
2       New York                  200
3         Alaska                   50
4  Massachusetts                  250

Use matplotlib commands to construct and display the appropriate visual:

plt.bar(data_frame_to_be_visualized["location"], data_frame_to_be_visualized["weekly_expenses_avg"])
plt.xlabel("Location")
plt.ylabel("Average Weekly Expenses")
plt.title("Average Weekly Expenses by Location")
plt.show()

6.2.17 How to Close a SQL Database Connection

sql_database_connection.close()

6.2.17.1 Further Resources:

6.2.17.2 Books:

Oracle Database 12C SQL by Matthew Morris
Geeks for Geeks SQL Manual
Learning SQL by Alan Beaulieu #### Websites:
https://www.geeksforgeeks.org/python/python-sqlite-connecting-to-database/
https://docs.python.org/3/library/sqlite3.html
https://datacarpentry.github.io/python-ecology-lesson/instructor/09-working-with-sql.html
https://www.geeksforgeeks.org/python/python-sqlite/
https://datacarpentry.github.io/python-socialsci/instructor/14-sqlite.html

6.3 Web Scraping

The presentation was made by Sean Tessman.

6.3.1 Introduction

6.3.1.1 What is Web Scraping?

Web Scraping is essentially using code to automatically pull information from websites, rather than copy-pasting it by hand.
Websites organize their data using HTML, which is just the basic structure behind what you see on a webpage.
Scraping lets us take information from a website and turn it into an organized table of data that we can actually work with and analyze.

6.3.2 Why It Matters in Data Science

A lot of real-world data isn’t properly available as a downloadable CSV file.
Web scraping helps us to:
- Gather information that’s publicly available online.
- Create our own datasets from websites
- Save us time by collecting data automatically
It helps turn raw information from the web into something we can analyze and visualize.

6.3.3 How Web Pages Work

6.3.4 HTML Basics

Web pages are structured using HTML tags such as:

<div> –> Used to group sections of a page
<p> –> Used for paragraphs of text
<a> –> Used for links
<table> –> Used to display data in rows and columns

When we scrape a website, we look for the specific tag that contains the information we want and pull that content out.

6.3.5 Static vs Dynamic Websites

6.3.5.1 Static websites (easier to scrape)

The data is already in the HTML you downloaded.
When you request the page, the content you need is usually right there.
Tools like requests, pandas.read_html(), and BeautifulSoup work well.

6.3.5.2 Dynamic websites (harder to scrape)

The page loads, then JavaScript pulls in the data after.
The first HTML you download might be missing the info you want.
You may need Selenium/Playwright, or sometimes an API that the page is calling.

6.3.5.3 Why this matters

If a site is dynamic, your scraper might “work” but return empty results.

6.3.6 The Basic Workflow

Pick a website
Make sure the data is public and allowed to be collected.
Download the page
Send a request to get the website’s HTML.
Extract the data
Pull the specific tables, text, or elements you need.
Clean and analyze
Organize the data into a usable format and create visuals or summaries.

6.3.7 Using BeautifulSoup

6.3.7.1 What is BeautifulSoup

BeautifulSoup is a Python tool that helps you look through and pull information from a webpage’s HTML code. While pandas.read_html() is useful for grabbing whole tables, BeautifulSoup is better when you want specific things like text, links, or section titles. It works by organizing the HTML into a structure that makes it easier to search and extract exactly what you need.

6.3.7.2 Why use BeautifulSoup?

Get text that isn’t in a table
Grab links or other small pieces of information
Look for specific parts of a page, like div, p, or span
Work with more complicated pages

It’s most helpful when the data isn’t neatly organized in a table.

For structured tables –> use pandas.read_html()

For flexible HTML parsing –> use BeautifulSoup

6.3.8 Python Setup

6.3.8.1 Required Libraries

import requests
import pandas as pd
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt

6.3.9 Example: Scraping a Public Table (Wikipedia)

6.3.9.1 Step 1: Choose a Page

We’ll scrape a public population table from Wikipedia:

url = "https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
url

'https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population'

6.3.9.2 Step 2: Request the Page

headers = {
    "User-Agent": "DataSciencePresentaion-WebScraping"
}

response = requests.get(url, headers=headers, timeout=30)
response.status_code

6.3.9.2.1 What The Code Does:

Sets up a header for the website to know who is making the request
Sends a request to the webpage
waits up to 30 seconds for a response
Saves the webpage’s response in a variable called response
Prints the status code from the website

6.3.9.3 Step 3: Extract the Table into a DataFrame

Instead of going through each table row and cell by one, we can use pandas.read_html() to automatically grab the whole table from the webpage.

from io import StringIO

tables = pd.read_html(StringIO(response.text))
len(tables)

Wikipedia pages usually have several tables on them. We’ll just pick the main sortable table that has the “State or territory” column in it.

df = None
for t in tables:
    cols = [str(c).lower() for c in t.columns]
    if any("state or territory" in c for c in cols):
        df = t
        break

df.head()

	State or territory	Census population[8][9][a]		House seats[b]		Pop. per elec. vote (2020)[c]	Pop. per seat (2020)[a]	% US (2020)	% EC (2020)
	State or territory	July 1, 2025 (est.)	April 1, 2020	Seats	%	Pop. per elec. vote (2020)[c]	Pop. per seat (2020)[a]	% US (2020)	% EC (2020)
0	California	39355309.0	39538223	52	11.95%	732189	760350	11.800%	10.04%
1	Texas	31709821.0	29145505	38	8.74%	728638	766987	8.698%	7.43%
2	Florida	23462518.0	21538187	28	6.44%	717940	769221	6.428%	5.58%
3	New York	20002427.0	20201249	26	5.98%	721473	776971	6.029%	5.20%
4	Pennsylvania	13059432.0	13002700	17	3.91%	684353	764865	3.881%	3.53%

6.3.9.4 Step 4: Clean the Data

This table usually includes a few “summary” rows that aren’t actual states. We’ll remove them and clean the population column.

df.columns = [str(c).strip().lower().replace(" ", "_") for c in df.columns]

state_candidates = [c for c in df.columns if "state" in c and "territory" in c]
if not state_candidates:
    state_candidates = [c for c in df.columns if "state" in c]

if not state_candidates:
    raise ValueError(f"Couldn't find a state column. Columns are: {df.columns.tolist()}")

state_col = state_candidates[0]

pop_candidates = [c for c in df.columns if "population" in c]
if not pop_candidates:
    raise ValueError(f"Couldn't find a population column. Columns are: {df.columns.tolist()}")

pop_col = pop_candidates[0]

df["population"] = (
    df[pop_col].astype(str)
      .str.replace(",", "", regex=False)
)

df["population"] = pd.to_numeric(df["population"], errors="coerce")

df["state_clean"] = (
    df[state_col].astype(str).str.strip().str.lower().str.replace(" ", "_", regex=False)
)

bad_rows = {"the_50_states_and_d.c.", "the_50_states", "contiguous_united_states"}
df_clean = df[~df["state_clean"].isin(bad_rows)].copy()

df_clean = df_clean.dropna(subset=["population"])

df_clean[[state_col, "population"]].head()

	('state_or_territory',_'state_or_territory')	population
0	California	39355309.0
1	Texas	31709821.0
2	Florida	23462518.0
3	New York	20002427.0
4	Pennsylvania	13059432.0

This section cleans and prepares the data so it’s ready to analyze. It standardizes the column names, finds the correct state and population columns, and converts the population values into numbers. It also removes summary rows and any missing values so we are only working with actual states and usable data.

6.3.10 Example Visualization

6.3.10.1 Top 10 States by Population

top10 = df_clean.sort_values("population", ascending=False).head(10)

plt.figure(figsize=(10, 6))

plt.barh(top10[state_col].astype(str)[::-1], top10["population"][::-1])

plt.title("Top 10 U.S. States by Population")
plt.xlabel("Population")
plt.ylabel("")
plt.tight_layout()
plt.show()

6.3.11 Saving the Data

6.3.11.1 Export to CSV

df_clean.to_csv("scraped_population_table.csv", index=False)
"Saved: scraped_population_table.csv"

'Saved: scraped_population_table.csv'

6.3.12 Scraping Text (Not Just Tables)

6.3.12.1 Example: Pull a Few Headings with BeautifulSoup

This here shows how scraping isn’t just for tables.

soup = BeautifulSoup(response.text, "html.parser")

headings = [h.get_text(strip=True) for h in soup.find_all(["h2", "h3"])]

headings[:10]

['Contents',
 'Method',
 'Electoral apportionment',
 'House of Representatives',
 'Electoral College',
 'State and territory rankings',
 'Summary of population by region',
 'See also',
 'Notes',
 'References']

6.3.13 Scraping Responsibly

Look at the website’s robots.txt file and read its rules before scraping (Terms of Service)
Don’t send too many requests too quickly –> add pauses if you’re looping.
Stick to publicly available information.
Be considerate and make sure your scraping doesn’t slow the site down.

6.3.14 Limitations

Websites can change their layout, which can cause your scraping code to stop working.
Some sites load content using JavaScript, so you might need a tool like Selenium or an API instead
There are legal and ethical rules you have to follow when scraping.

6.3.15 Conclusion

Web scraping helps you collect public data from websites automatically.
It’s useful when the data isn’t already in a CSV.
Tables can be pulled with pandas.read_html(), and text/links can be pulled with BeautifulSoup.
APIs are usually more stable when they exist.
Scraping can break if a website changes, so it needs maintenance.

6.3.16 References

Wikipedia contributors. (2025). List of U.S. states and territories by population. Wikipedia.
https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population

CCS Learning Academy. (2024). What is web scraping?
https://www.ccslearningacademy.com/what-is-web-scraping/

Real Python. (2024). Beautiful Soup: Build a web scraper with Python.
https://realpython.com/beautiful-soup-web-scraper-python/

6.4 Accessing U.S. Census Data

This presentation is prepared by Claire Lawrence

6.4.1 Introduction

The U.S. Census Bureau provides some of the most widely used demographic and geographic data in the United States. These data are used in research, public policy, business analysis, and mapping.

6.4.2 Why Census Data Matters

6.4.2.1 Common uses of Census data

Population counts
Population characteristics (gender, age, and race.)
Income and poverty measures
Housing characteristics (occupancy, vacancy, tenure)
Educational attainment
Geographic comparisons across states, counties, and census tracts

6.4.2.2 Major Census products

The U.S. Census Bureau provides a multitude of datasets that cover the United States. The most important are:

Decennial Census: full population count every 10 years
American Community Survey (ACS): yearly estimates on social and economic characteristics
TIGER/Line shapefiles: spatial boundary files for mapping

6.4.3 Accessing Census Data

6.4.3.1 Main ways to get Census data

Download tables manually from data.census.gov
Use the Census API (preferred method)
Download geographic boundary files from TIGER/Line
Access data through Python or R packages

6.4.3.2 Why use the API?

While you may be drawn to manually downloading census data locally and importing it, using the API will make your workflow smoother.

Using the API makes your code: - Reproducible - Easy to update - Good for larger workflows - Useful for combining with code and mapping

Instead of worrying about your downloads and folders, Python finds your data for you. This is incredibly useful if you are working with others.

6.4.4 Python Setup

6.4.4.1 Packages

import requests
import pandas as pd
import plotly.express as px

These packages let us request Census data from the API, turn the response into a dataframe, and visualize it.

6.4.5 Example 1: Pulling Census Data from the API

6.4.5.1 API example: total population by state

url = "https://api.census.gov/data/2022/acs/acs5?get=NAME,B01003_001E&for=state:*"
response = requests.get(url)
data = response.json()
data[:5]

[['NAME', 'B01003_001E', 'state'],
 ['Alabama', '5028092', '01'],
 ['Alaska', '734821', '02'],
 ['Arizona', '7172282', '04'],
 ['Arkansas', '3018669', '05']]

This query asks for:

NAME: the place name
B01003_001E: total population estimate
for=state:*: all states

There are many datasets made publicly available through the Census Bureau. You can also submit requests through their developer forum. The Census Bureau also provides an API discovery tool that shows all available datasets, their geographies and variables, and how to access them via API.

6.4.5.2 Converting the response into a dataframe

df = pd.DataFrame(data[1:], columns=data[0])
df.head()

	NAME	B01003_001E	state
0	Alabama	5028092	01
1	Alaska	734821	02
2	Arizona	7172282	04
3	Arkansas	3018669	05
4	California	39356104	06

6.4.5.3 Cleaning the data

df["B01003_001E"] = df["B01003_001E"].astype(int)

df = df.rename(columns={
    "NAME": "state_name",
    "B01003_001E": "population",
    "state": "state_fips"
})

df.head()

	state_name	population	state_fips
0	Alabama	5028092	01
1	Alaska	734821	02
2	Arizona	7172282	04
3	Arkansas	3018669	05
4	California	39356104	06

Now the data are in a more readable format and ready for analysis!

6.4.6 Example 2: Another useful Census variable

6.4.6.1 Median household income by state

income_url = "https://api.census.gov/data/2022/acs/acs5?get=NAME,B19013_001E&for=state:*"
income_response = requests.get(income_url)
income_data = income_response.json()

income_df = pd.DataFrame(income_data[1:], columns=income_data[0])
income_df["B19013_001E"] = pd.to_numeric(income_df["B19013_001E"], errors="coerce")

income_df = income_df.rename(columns={
    "NAME": "state_name",
    "B19013_001E": "median_income",
    "state": "state_fips"
})

income_df.head()

	state_name	median_income	state_fips
0	Alabama	59609	01
1	Alaska	86370	02
2	Arizona	72581	04
3	Arkansas	56335	05
4	California	91905	06

The same API structure can be reused for different variables.

6.4.7 Visualizing Census Data

Once Census data are in a dataframe, they can be visualized using standard Python plotting libraries. Before making a map, lets look at a few simpler visualizations.

6.4.7.1 Example 1: Bar chart of Poverty rates

6.4.7.2 Pulling poverty data

To calculate poverty rate, we pull the number of people below the poverty line from the ACS.

Code

poverty_url = "https://api.census.gov/data/2022/acs/acs5?get=NAME,B17001_002E&for=state:*"
poverty_response = requests.get(poverty_url)
poverty_data = poverty_response.json()

poverty_df = pd.DataFrame(poverty_data[1:], columns=poverty_data[0])

poverty_df = poverty_df.rename(columns={
    "NAME": "state_name",
    "B17001_002E": "poverty_count",
    "state": "state_fips"
})

poverty_df["poverty_count"] = pd.to_numeric(poverty_df["poverty_count"], errors="coerce")

poverty_df.head()

	state_name	poverty_count	state_fips
0	Alabama	768897	01
1	Alaska	75227	02
2	Arizona	916876	04
3	Arkansas	475729	05
4	California	4685272	06

Calculating poverty rate:

Code

poverty_merged = pd.merge(
    df,
    poverty_df.drop(columns=["state_name"]),
    on="state_fips"
)

poverty_merged["poverty_rate"] = (
    poverty_merged["poverty_count"] / poverty_merged["population"]
) * 100

poverty_merged.head()

	state_name	population	state_fips	poverty_count	poverty_rate
0	Alabama	5028092	01	768897	15.292023
1	Alaska	734821	02	75227	10.237459
2	Arizona	7172282	04	916876	12.783602
3	Arkansas	3018669	05	475729	15.759562
4	California	39356104	06	4685272	11.904817

6.4.7.3 Example 1: States with highest poverty rates

This bar chart compares poverty rates across states using ACS data.

top_poverty = poverty_merged.sort_values("poverty_rate", ascending=False).head(10)

fig = px.bar(
    top_poverty,
    x="state_name",
    y="poverty_rate",
    color="poverty_rate",
    title="Top 10 States by Poverty Rate",
    labels={
        "poverty_rate": "Poverty Rate (%)",
        "state_name": "State"
    }
)

fig.show()

6.4.7.4 Example 2: Distribution of state populations

Histograms help us understand how Census values are distributed across geographic units.

fig = px.histogram(
    df,
    x="population",
    nbins=20,
    title="Distribution of State Populations"
)

fig.show()

6.4.7.5 Example 3: Comparing population and income

Scatter plots allow us to compare two Census variables across geographic units.

merged = pd.merge(
    df,
    income_df.drop(columns=["state_name"]),
    on="state_fips"
)

fig = px.scatter(
    merged,
    x="population",
    y="median_income",
    hover_name="state_name",
    color="median_income",
    title="Population vs Median Household Income"
)

fig.show()

This makes it possible to compare whether states with larger populations also tend to have higher median household incomes.

6.4.8 Spatial Census Data

6.4.8.1 Why the spatial side matters

Census data are tied to geography. That makes them useful for:

choropleth maps
regional comparisons
policy analysis across places
combining social data with boundaries We will be looking at the national level for this next example, but you can go as small as census blocks (typically around the size of one city block).

6.4.9 Spatial Example

6.4.9.1 Mapping Census data by state

Instead of using shapefiles, I will make a choropleth map directly in Python using Plotly. This is an easier way to visualize Census data spatially at the state level. You can use stronger tools in Python such as geoplotlib, or even import Census data into GIS such as ArcGIS.

6.4.9.2 Pulling median household income by state

income_url = "https://api.census.gov/data/2022/acs/acs5?get=NAME,B19013_001E&for=state:*"
income_response = requests.get(income_url)
income_data = income_response.json()

income_df = pd.DataFrame(income_data[1:], columns=income_data[0])

income_df = income_df.rename(columns={
    "NAME": "state_name",
    "B19013_001E": "median_income",
    "state": "state_fips"
})

income_df["median_income"] = pd.to_numeric(income_df["median_income"], errors="coerce")
income_df.head()

	state_name	median_income	state_fips
0	Alabama	59609	01
1	Alaska	86370	02
2	Arizona	72581	04
3	Arkansas	56335	05
4	California	91905	06

6.4.9.3 Add state abbreviations for mapping

Code

state_abbrev = {
    "Alabama": "AL", "Alaska": "AK", "Arizona": "AZ", "Arkansas": "AR",
    "California": "CA", "Colorado": "CO", "Connecticut": "CT", "Delaware": "DE",
    "District of Columbia": "DC", "Florida": "FL", "Georgia": "GA", "Hawaii": "HI",
    "Idaho": "ID", "Illinois": "IL", "Indiana": "IN", "Iowa": "IA",
    "Kansas": "KS", "Kentucky": "KY", "Louisiana": "LA", "Maine": "ME",
    "Maryland": "MD", "Massachusetts": "MA", "Michigan": "MI", "Minnesota": "MN",
    "Mississippi": "MS", "Missouri": "MO", "Montana": "MT", "Nebraska": "NE",
    "Nevada": "NV", "New Hampshire": "NH", "New Jersey": "NJ", "New Mexico": "NM",
    "New York": "NY", "North Carolina": "NC", "North Dakota": "ND", "Ohio": "OH",
    "Oklahoma": "OK", "Oregon": "OR", "Pennsylvania": "PA", "Rhode Island": "RI",
    "South Carolina": "SC", "South Dakota": "SD", "Tennessee": "TN", "Texas": "TX",
    "Utah": "UT", "Vermont": "VT", "Virginia": "VA", "Washington": "WA",
    "West Virginia": "WV", "Wisconsin": "WI", "Wyoming": "WY", "Puerto Rico": "PR"
}

income_df["abbr"] = income_df["state_name"].map(state_abbrev)
income_df.head()

	state_name	median_income	state_fips	abbr
0	Alabama	59609	01	AL
1	Alaska	86370	02	AK
2	Arizona	72581	04	AZ
3	Arkansas	56335	05	AR
4	California	91905	06	CA

6.4.9.4 Choropleth map

fig = px.choropleth(
    income_df,
    locations="abbr",
    locationmode="USA-states",
    color="median_income",
    scope="usa",
    hover_name="state_name",
    title="Median Household Income by State (ACS 2022)"
)

fig.show()

6.4.10 Conclusion

Census data are one of the most widely used public datasets in the United States. By accessing the Census API with Python, we can automatically retrieve demographic data, organize it for analysis, and visualize patterns across geographic areas.

6.4.10.1 Key takeaways

The U.S. Census Bureau provides extensive demographic, economic, and geographic data.
The Census API allows these datasets to be accessed programmatically instead of downloading files manually.
Python tools like pandas and plotly make it easy to clean, analyze, and visualize Census data.
Census variables can be explored through multiple visualization methods, including bar charts, histograms, scatter plots, and maps.
Combining Census data with geographic context helps reveal spatial patterns in social and economic data.