6  Data Acquisition & Engineering

6.1 Import and Export Data

This presentation was prepared by Zaynab Faisal.

6.1.1 Introduction

Data analysis begins with importing data and ends with exporting results. Before creating visualizations or running models, the dataset must be loaded correctly into memory. Small mistakes such as incorrect file paths, wrong delimiters or unparsed date columns can change how the data behaves. If these issues are not identified early, they affect every step of the workflow.

This presentation focuses on building a reproducible importing and exporting workflow using pandas in Python.

6.1.2 Pandas

Pandas is a Python library used for working with tabular data. It allows data to be stored in a DataFrame, which is similar to a spreadsheet or SQL table.

6.1.3 Importing CSV Files

CSV (Comma Separated Values) files are the most common format for storing structured data. In pandas, CSV files are imported using the read_csv() function.

6.1.3.1 Real-World Example

The dataset used in this example is nyc_crashes_lbdwk_2025.csv, located in the data folder of the course repository. This file contains weekly crash records and has not been cleaned in order to demonstrate common issues that arise during the importing process.

import pandas as pd

#Import CSV file into Data Frame
df = pd.read_csv("data/nyc_crashes_lbdwk_2025.csv")

#Show the first 10 rows of the dataset. 
df.head(10)
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME ... CONTRIBUTING FACTOR VEHICLE 2 CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4 CONTRIBUTING FACTOR VEHICLE 5 COLLISION_ID VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5
0 08/31/2025 12:49 QUEENS 11101.0 40.753113 -73.933700 (40.753113, -73.9337) 30 ST 39 AVE NaN ... NaN NaN NaN NaN 4838875 Station Wagon/Sport Utility Vehicle NaN NaN NaN NaN
1 08/31/2025 15:30 MANHATTAN 10022.0 40.760600 -73.964320 (40.7606, -73.96432) E 59 ST 2 AVE NaN ... NaN NaN NaN NaN 4839110 Station Wagon/Sport Utility Vehicle NaN NaN NaN NaN
2 08/31/2025 19:00 NaN NaN 40.734234 -73.722750 (40.734234, -73.72275) CROSS ISLAND PARKWAY HILLSIDE AVENUE NaN ... Unspecified Unspecified NaN NaN 4838966 Sedan Sedan NaN NaN NaN
3 08/31/2025 1:19 BROOKLYN 11220.0 40.648075 -74.007034 (40.648075, -74.007034) NaN NaN 4415 5 AVE ... Unspecified NaN NaN NaN 4838563 Sedan E-Bike NaN NaN NaN
4 08/31/2025 2:41 MANHATTAN 10036.0 40.756560 -73.986110 (40.75656, -73.98611) W 43 ST BROADWAY NaN ... Unspecified NaN NaN NaN 4838922 Station Wagon/Sport Utility Vehicle Bike NaN NaN NaN
5 08/31/2025 19:42 BRONX 10466.0 40.887672 -73.847046 (40.887672, -73.847046) LACONIA AVE GRENADA PL NaN ... NaN NaN NaN NaN 4839184 NaN NaN NaN NaN NaN
6 08/31/2025 8:21 BROOKLYN 11221.0 40.689217 -73.917650 (40.689217, -73.91765) BUSHWICK AVE PUTNAM AVE NaN ... Unspecified Unspecified NaN NaN 4839066 Sedan Sedan Station Wagon/Sport Utility Vehicle NaN NaN
7 08/31/2025 2:50 QUEENS 11411.0 40.685990 -73.728730 (40.68599, -73.72873) 120 AVE 236 ST NaN ... Unspecified Unspecified Unspecified NaN 4838591 Sedan Sedan Sedan Station Wagon/Sport Utility Vehicle NaN
8 08/31/2025 21:45 QUEENS 11434.0 40.673023 -73.776566 (40.673023, -73.776566) 161 ST 134 AVE NaN ... Unspecified Unspecified Unspecified NaN 4839302 Sedan Sedan Sedan Sedan NaN
9 08/31/2025 10:25 QUEENS 11004.0 40.736750 -73.711030 (40.73675, -73.71103) HILLSIDE AVE 257 ST NaN ... Unspecified NaN NaN NaN 4838641 Station Wagon/Sport Utility Vehicle Sedan NaN NaN NaN

10 rows × 29 columns

Importing the file is only the first step. After loading the dataset, it is important to verify its structure before proceeding with analysis.

6.1.3.2 Verifying the Structure

Importing a dataset doesn’t guarantee that it was interpreted correctly. After immediately loading the file, the structure should be verified. So, checking the data types of each column is critical and is completed using the dtypes attribute.

df.shape
(1487, 29)

The shape attribute will confirm the number of rows and columns. This helps detect incomplete or truncated files.

df.dtypes
CRASH DATE                        object
CRASH TIME                        object
BOROUGH                           object
ZIP CODE                         float64
LATITUDE                         float64
LONGITUDE                        float64
LOCATION                          object
ON STREET NAME                    object
CROSS STREET NAME                 object
OFF STREET NAME                   object
NUMBER OF PERSONS INJURED          int64
NUMBER OF PERSONS KILLED           int64
NUMBER OF PEDESTRIANS INJURED      int64
NUMBER OF PEDESTRIANS KILLED       int64
NUMBER OF CYCLIST INJURED          int64
NUMBER OF CYCLIST KILLED           int64
NUMBER OF MOTORIST INJURED         int64
NUMBER OF MOTORIST KILLED          int64
CONTRIBUTING FACTOR VEHICLE 1     object
CONTRIBUTING FACTOR VEHICLE 2     object
CONTRIBUTING FACTOR VEHICLE 3     object
CONTRIBUTING FACTOR VEHICLE 4     object
CONTRIBUTING FACTOR VEHICLE 5     object
COLLISION_ID                       int64
VEHICLE TYPE CODE 1               object
VEHICLE TYPE CODE 2               object
VEHICLE TYPE CODE 3               object
VEHICLE TYPE CODE 4               object
VEHICLE TYPE CODE 5               object
dtype: object

The dtypes attribute shows how pandas interpreted each column. This is important because data type determines how a variable behaves during filtering and modeling.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1487 entries, 0 to 1486
Data columns (total 29 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   CRASH DATE                     1487 non-null   object 
 1   CRASH TIME                     1487 non-null   object 
 2   BOROUGH                        1203 non-null   object 
 3   ZIP CODE                       1203 non-null   float64
 4   LATITUDE                       1475 non-null   float64
 5   LONGITUDE                      1475 non-null   float64
 6   LOCATION                       1475 non-null   object 
 7   ON STREET NAME                 1031 non-null   object 
 8   CROSS STREET NAME              900 non-null    object 
 9   OFF STREET NAME                456 non-null    object 
 10  NUMBER OF PERSONS INJURED      1487 non-null   int64  
 11  NUMBER OF PERSONS KILLED       1487 non-null   int64  
 12  NUMBER OF PEDESTRIANS INJURED  1487 non-null   int64  
 13  NUMBER OF PEDESTRIANS KILLED   1487 non-null   int64  
 14  NUMBER OF CYCLIST INJURED      1487 non-null   int64  
 15  NUMBER OF CYCLIST KILLED       1487 non-null   int64  
 16  NUMBER OF MOTORIST INJURED     1487 non-null   int64  
 17  NUMBER OF MOTORIST KILLED      1487 non-null   int64  
 18  CONTRIBUTING FACTOR VEHICLE 1  1478 non-null   object 
 19  CONTRIBUTING FACTOR VEHICLE 2  1132 non-null   object 
 20  CONTRIBUTING FACTOR VEHICLE 3  129 non-null    object 
 21  CONTRIBUTING FACTOR VEHICLE 4  40 non-null     object 
 22  CONTRIBUTING FACTOR VEHICLE 5  13 non-null     object 
 23  COLLISION_ID                   1487 non-null   int64  
 24  VEHICLE TYPE CODE 1            1470 non-null   object 
 25  VEHICLE TYPE CODE 2            1012 non-null   object 
 26  VEHICLE TYPE CODE 3            124 non-null    object 
 27  VEHICLE TYPE CODE 4            35 non-null     object 
 28  VEHICLE TYPE CODE 5            13 non-null     object 
dtypes: float64(3), int64(9), object(17)
memory usage: 337.0+ KB

The info() function provides a summary of the counts and data types. This helps identify columns with missing values and detect unexpected type assignments.

6.1.4 Common Mistakes When Importing CSV Files

By importing a dataset, it doesn’t guarantee that it was interpreted correctly. Many issues occur silently, meaning the dataset loads without error but behaves incorrectly during analysis.

The following examples demonstrate common mistakes and how to identify and fix them.

6.1.4.1 Incorrect File Path

A very common issue that arises when importing data is an incorrect file path.

try:
    pd.read_csv("nyc_crashes_lbdwk_2025.csv")
except FileNotFoundError as e:
    print("FileNotFoundError:", e)
FileNotFoundError: [Errno 2] No such file or directory: 'nyc_crashes_lbdwk_2025.csv'

This error occurs because Python searches for the file in the current working directory. Since the file is stored inside the data folder, the relative path must include that folder name.

In order to confirm the working directory:

import os
os.getcwd()
'/Users/kun.chen/Library/CloudStorage/Dropbox/Teaching/Spring-2026/STAT3255/ids-s26'

The working directory determines where Python searches for files when using relative paths. If the file is stored inside a folder, the path must reflect that structure.

6.1.4.2 Fixing the Path

The correct path to the file would be as follows:

pd.read_csv("data/nyc_crashes_lbdwk_2025.csv").head()
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME ... CONTRIBUTING FACTOR VEHICLE 2 CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4 CONTRIBUTING FACTOR VEHICLE 5 COLLISION_ID VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5
0 08/31/2025 12:49 QUEENS 11101.0 40.753113 -73.933700 (40.753113, -73.9337) 30 ST 39 AVE NaN ... NaN NaN NaN NaN 4838875 Station Wagon/Sport Utility Vehicle NaN NaN NaN NaN
1 08/31/2025 15:30 MANHATTAN 10022.0 40.760600 -73.964320 (40.7606, -73.96432) E 59 ST 2 AVE NaN ... NaN NaN NaN NaN 4839110 Station Wagon/Sport Utility Vehicle NaN NaN NaN NaN
2 08/31/2025 19:00 NaN NaN 40.734234 -73.722750 (40.734234, -73.72275) CROSS ISLAND PARKWAY HILLSIDE AVENUE NaN ... Unspecified Unspecified NaN NaN 4838966 Sedan Sedan NaN NaN NaN
3 08/31/2025 1:19 BROOKLYN 11220.0 40.648075 -74.007034 (40.648075, -74.007034) NaN NaN 4415 5 AVE ... Unspecified NaN NaN NaN 4838563 Sedan E-Bike NaN NaN NaN
4 08/31/2025 2:41 MANHATTAN 10036.0 40.756560 -73.986110 (40.75656, -73.98611) W 43 ST BROADWAY NaN ... Unspecified NaN NaN NaN 4838922 Station Wagon/Sport Utility Vehicle Bike NaN NaN NaN

5 rows × 29 columns

6.1.4.3 Wrong Delimiter

In some files, the separator between values is not a comma. If the delimiter is incorrect, the entire dataset may appear as a single column. In those cases, the correct separator must be specified.

# Example of incorrect delimiter
df_wrong = pd.read_csv("data/nyc_crashes_lbdwk_2025.csv", sep=";")
df_wrong.head(10)
CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,NUMBER OF PERSONS INJURED,NUMBER OF PERSONS KILLED,NUMBER OF PEDESTRIANS INJURED,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST INJURED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST INJURED,NUMBER OF MOTORIST KILLED,CONTRIBUTING FACTOR VEHICLE 1,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
0 08/31/2025,12:49,QUEENS,11101,40.753113,-73.93...
1 08/31/2025,15:30,MANHATTAN,10022,40.7606,-73.9...
2 08/31/2025,19:00,,,40.734234,-73.72275,"(40.73...
3 08/31/2025,1:19,BROOKLYN,11220,40.648075,-74.0...
4 08/31/2025,2:41,MANHATTAN,10036,40.75656,-73.9...
5 08/31/2025,19:42,BRONX,10466,40.887672,-73.847...
6 08/31/2025,8:21,BROOKLYN,11221,40.689217,-73.9...
7 08/31/2025,2:50,QUEENS,11411,40.68599,-73.7287...
8 08/31/2025,21:45,QUEENS,11434,40.673023,-73.77...
9 08/31/2025,10:25,QUEENS,11004,40.73675,-73.711...

In this case, the entire dataset loads into a single column. This happens because the file actually uses commas, not semicolons.

The correct import removes the incorrect separator argument:

df_correct = pd.read_csv("data/nyc_crashes_lbdwk_2025.csv")
df_correct.head()
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME ... CONTRIBUTING FACTOR VEHICLE 2 CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4 CONTRIBUTING FACTOR VEHICLE 5 COLLISION_ID VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5
0 08/31/2025 12:49 QUEENS 11101.0 40.753113 -73.933700 (40.753113, -73.9337) 30 ST 39 AVE NaN ... NaN NaN NaN NaN 4838875 Station Wagon/Sport Utility Vehicle NaN NaN NaN NaN
1 08/31/2025 15:30 MANHATTAN 10022.0 40.760600 -73.964320 (40.7606, -73.96432) E 59 ST 2 AVE NaN ... NaN NaN NaN NaN 4839110 Station Wagon/Sport Utility Vehicle NaN NaN NaN NaN
2 08/31/2025 19:00 NaN NaN 40.734234 -73.722750 (40.734234, -73.72275) CROSS ISLAND PARKWAY HILLSIDE AVENUE NaN ... Unspecified Unspecified NaN NaN 4838966 Sedan Sedan NaN NaN NaN
3 08/31/2025 1:19 BROOKLYN 11220.0 40.648075 -74.007034 (40.648075, -74.007034) NaN NaN 4415 5 AVE ... Unspecified NaN NaN NaN 4838563 Sedan E-Bike NaN NaN NaN
4 08/31/2025 2:41 MANHATTAN 10036.0 40.756560 -73.986110 (40.75656, -73.98611) W 43 ST BROADWAY NaN ... Unspecified NaN NaN NaN 4838922 Station Wagon/Sport Utility Vehicle Bike NaN NaN NaN

5 rows × 29 columns

6.1.4.4 Unexpected Data Types

Numeric columns may be interpreted as object type if they contain inconsistent values or non-numeric characters.

df["ZIP CODE"].dtype
dtype('float64')

To convert the column safely:

df["ZIP CODE"] = pd.to_numeric(df["ZIP CODE"], errors="coerce")
df["ZIP CODE"].dtype
dtype('float64')

The argument errors=“coerce” converts invalid values into missing values rather than raising an error.

6.1.4.5 Missing Values

Missing values affect summary statistics and modeling decisions. They should be inspected early.

df.isnull().sum()
CRASH DATE                          0
CRASH TIME                          0
BOROUGH                           284
ZIP CODE                          284
LATITUDE                           12
LONGITUDE                          12
LOCATION                           12
ON STREET NAME                    456
CROSS STREET NAME                 587
OFF STREET NAME                  1031
NUMBER OF PERSONS INJURED           0
NUMBER OF PERSONS KILLED            0
NUMBER OF PEDESTRIANS INJURED       0
NUMBER OF PEDESTRIANS KILLED        0
NUMBER OF CYCLIST INJURED           0
NUMBER OF CYCLIST KILLED            0
NUMBER OF MOTORIST INJURED          0
NUMBER OF MOTORIST KILLED           0
CONTRIBUTING FACTOR VEHICLE 1       9
CONTRIBUTING FACTOR VEHICLE 2     355
CONTRIBUTING FACTOR VEHICLE 3    1358
CONTRIBUTING FACTOR VEHICLE 4    1447
CONTRIBUTING FACTOR VEHICLE 5    1474
COLLISION_ID                        0
VEHICLE TYPE CODE 1                17
VEHICLE TYPE CODE 2               475
VEHICLE TYPE CODE 3              1363
VEHICLE TYPE CODE 4              1452
VEHICLE TYPE CODE 5              1474
dtype: int64

Checking missing values early helps identify which variables may require cleaning before modeling.

6.1.5 Importing Other Common Formats

While CSV is the most common format, pandas supports many other file types used in practice.

6.1.5.1 Excel Files

Excel files are frequently used in business settings. They can be imported using read_excel().

df_excel = pd.read_excel("data/example.xlsx", sheet_name="Sheet1")

Excel files may contain multiple sheets, so specifying sheet_name is often necessary. Similar to CSV, the structure should be verified immediately after import using .shape, .info(), and .dtypes.

6.1.5.2 JSON Files

JSON files are commonly used in web applications and APIs. They can be imported using read_json().

df_json = pd.read_json("data/example.json")

Unlike CSV and Excel, JSON files may contain nested structures. In those cases, additional processing such as normalization may be required before analysis.

6.1.6 Comparing CSV and Feather Formats

CSV is the most common format for tabular data, but it is not always the most efficient. Feather is a binary format designed for speed and type preservation.

Feather files require the pyarrow dependency. In some environments, attempting to read a Feather file without pyarrow installed will result in an ImportError. This highlights an important reproducibility lesson: certain file formats require additional dependencies that must be installed before use.

Feather files are optimized for performance and type preservation, making them well-suited for large-scale workflows. However, because they are binary and not human-readable, CSV files remain more portable and easier to inspect manually.

6.1.7 Exporting Data

After verifying and cleaning the dataset, the results can be exported for reuse or collaboration. Exporting ensures that the cleaned version of the data can be shared or reused without repeating the cleaning steps.

For example, the cleaned DataFrame can be exported to a new CSV file.

df.to_csv("data/nyc_crashes_cleaned_export.csv", index=False)

This command creates a new CSV file containing the cleaned dataset. The argument index=False prevents pandas from writing the DataFrame index as an extra column. Forgetting this argument often results in unwanted columns such as “Unnamed: 0” when the file is imported again.

6.1.7.1 Exporting Figures

import matplotlib.pyplot as plt
import os

plt.figure()
df["ZIP CODE"].value_counts().head(10).plot(kind="bar")

plt.title("Top ZIP Codes by Crash Count")
plt.ylabel("Count")
plt.xticks(rotation=45)

os.makedirs("figures", exist_ok=True)

plt.savefig("figures/top_zipcodes.png", dpi=300, bbox_inches="tight")
plt.show()
plt.close()

The savefig() function saves the visualization to a file:

  • dpi=300 ensures high resolution for reports.
  • bbox_inches="tight" removes excess whitespace.
  • Saving figures to a figures/ folder improves project organization.

6.1.8 Best Practices for Importing and Exporting Data

The following practices help prevent silent errors:

  • Inspect the shape, data types, and missing values immediately after importing a dataset.
  • Use relative file paths within a repository to maintain portability.
  • Parse date columns during import instead of converting them later.
  • Avoid manually editing raw data files after they are stored.
  • Separate raw data from processed outputs within the project directory.
  • Always verify structure before beginning analysis.

6.1.9 Further Readings

  • https://realpython.com/python-csv/
  • https://miamioh.edu/centers-institutes/center-for-analytics-data-science/students/coding-tutorials/python/reading-data-and-writing-files.html
  • https://pandas.pydata.org/docs/user_guide/10min.html
  • https://www.sqlshack.com/exporting-data-with-pandas-in-python/
  • https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html

6.1.10 Thank you!

6.1.10.1 Questions?

6.2 Database Operation with Structured Query Language(SQL)

This presentation was prepared by Shreya Sudarsanam.

6.2.1 How To Use SQL In Python With Pandas Using a SQL-Created Table

  • First, we will import pandas
  • Then, we will import sqlite3
import pandas as pd
import sqlite3
  • Next, we will create a connection to a sql database using the command, sqlite3.connect(“name_of_database”), and store this in a variable.
sql_database_connection = sqlite3.connect(':memory:')
  • Now, we will explore the cursor.() command:
#A cursor variable will be created that will help run all SQL queries
cursor = sql_database_connection.cursor()
#Executes the SQL query to create the table
cursor.execute('CREATE TABLE lemon_types_new (organic_or_not TEXT, GMO_or_NOT TEXT)')
#Executes the SQL query to insert into the table
cursor.execute('INSERT INTO lemon_types_new VALUES ("Yes", "Yes")')
<sqlite3.Cursor at 0x117ce2740>
  • To store your sql query, create a variable.
sql_query = """SELECT * FROM lemon_types_new"""
  • To execute the sql_query, create a data frame using pandas. Then, use the function, .read_sql_query, to read the sql_query. The two arguments of the function are the sql_query and the sql_database_connection.
pandas_data_frame = pd.read_sql_query(sql_query, sql_database_connection)
print(pandas_data_frame)
  organic_or_not GMO_or_NOT
0            Yes        Yes

6.2.2 How To Use SQL In Python With Pandas Data Frame

  • First, we will import pandas
  • Then, we will import sqlite3
import pandas as pd
import sqlite3
  • Next, we create a pandas data frame with stored values:
pandas_data_frame_with_values = pd.DataFrame({"owners_of_lemonade_stands": ["Kathy", "Lily", "Lavender", "Samantha", "Rose", "Rebecca"],
                                             "week_one_sales_dollars": [100, 120, 300, 650, 200, 210],
                                             "week_two_sales_dollars": [120, 140, 380, 780, 220, 260],
                                             "week_three_sales_dollars": [20, 40, 80, 100, 50, 60],
                                             "week_four_sales_dollars": [200, 240, 600, 1310, 400, 420],
                                             "temperature_degrees_fareheit": [60, 70, 30, 80, 100, 120]})
  • Next, We will create a connection to a sql database using the command, sqlite3.connect(“name_of_database”), and store this in a variable.
sql_database_connection = sqlite3.connect("lemonade_stand_business.db")

Important note: If the database has not been created yet, the sqlite3.connect() command will create a new database.

  • “To convert the data frame to a table in sql, use the command, df.to_sql(). This command should have THREE arguments:

  • The first argument is name, which specifies the name of the dataframe.

  • The second argument is con, which specifies the connection to a specific database.

  • The third argument, which is optional, is if_exists=REPLACE. This argument makes sure that if there are two sql tables with the same name, the older sql tale will be replaced by the newer sql table.

  • The last argument, whcih is also optional, is index = (TRUE, FALSE). If index = TRUE, this creates a numerical index for the data frame. However, if index = FALSE, there will be no numerical index applied to the data frame.

pandas_data_frame_with_values.to_sql("lemonade_stand_data", sql_database_connection, if_exists="replace", index=True)
6
  • To store your sql query, create a variable.
sql_query_new = """SELECT * from lemonade_stand_data"""
  • To create a data frame that captures a view of your original SQL table, create a new data frame variable, and apply the .read_sql_query() function:
data_frame_select_all = pd.read_sql_query(sql_query_new, sql_database_connection)
print(data_frame_select_all)
   index owners_of_lemonade_stands  week_one_sales_dollars  \
0      0                     Kathy                     100   
1      1                      Lily                     120   
2      2                  Lavender                     300   
3      3                  Samantha                     650   
4      4                      Rose                     200   
5      5                   Rebecca                     210   

   week_two_sales_dollars  week_three_sales_dollars  week_four_sales_dollars  \
0                     120                        20                      200   
1                     140                        40                      240   
2                     380                        80                      600   
3                     780                       100                     1310   
4                     220                        50                      400   
5                     260                        60                      420   

   temperature_degrees_fareheit  
0                            60  
1                            70  
2                            30  
3                            80  
4                           100  
5                           120  

6.2.3 Different Ways To Establish SQL Database Connections

SQL Database Connections can be establish two ways: You can either: * create a temporary database in memory * or you can create a named database.

Example of creating a temporary database in memory:

sql_database_connection = sqlite3.connect(':memory:')

Example of creating a named database that can be saved:

sql_database_connection = sqlite3.connect("lemonade_stand_business.db")

6.2.4 How To Use SQLite 3 In Python Without Pandas

  • Import sqlite3
import sqlite3
  • Establish connection to a sql database. The following example establishes a connection to a temporary, in-memory database:
sql_database_connection = sqlite3.connect(':memory:')
  • Next, create an object, called cursor, to run SQL queries. This object is a variable that represents the .cursor() attribute of the sql database connection.
cursor = sql_database_connection.cursor()
  • Afterwards, execute a SQL command using cursor.execute(“SQL command”)
cursor.execute('CREATE TABLE new_lemonade_stand (owner TEXT, weekly_sales_avg INT)')
<sqlite3.Cursor at 0x117de0ec0>
  • Creates a table named lemonade_stand
  • The table has columns named owner and weekly_sales_avg
  • The data type for the owner column is TEXT
  • The data type for the weekly_sales_avg column is INT
cursor.execute('INSERT INTO new_lemonade_stand VALUES ("Lavender", 300)')
<sqlite3.Cursor at 0x117de0ec0>
  • Adds data to the lemonade_stand table, using the command, INSERT INTO

  • It is important to save any alterations to the database by using a special sql_database_connection.commit() command.

sql_database_connection.commit()
  • To view the result of a SQL query, it is necessary to perform the cursor.fetchall() function:
cursor.execute('SELECT * FROM new_lemonade_stand')
sql_table_result = cursor.fetchall()
print(sql_table_result)
[('Lavender', 300)]

6.2.5 How to display cursor.fetchall() results in a pandas data frame

cursor.execute('SELECT * FROM new_lemonade_stand')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
      owner  weekly_sales_avg
0  Lavender               300

6.2.6 Types of Commands In SQL

6.2.6.1 Data Definition Language (DDL):

    1. SQL statements that name, change the structuer of, and eliminate objects in a database

6.2.6.2 Data Control Language (DCL):

    1. SQL statements that control who has access to database data.

6.2.6.3 Data Manipulation Language (DML):

    1. SQL statements that do not impact database structure. Instead, these statements change/access objects inside a database.

6.2.6.4 Data Transaction Control (DTC):

  • 4.Track how Data Manipulation Language commands change objects in a database by logging these changes as transactions.

6.2.7 Exploring Types of Data Manipulation Language (DML) Commands

    1. Selection: Involves selecting specific rows from a table
    1. Projection: Involves selecting specific columns from a table
    1. Joining: Involves joining tables together

6.2.7.1 Exploring SELECTION with organic_lemonade_stand data table:

The below code helps create the organic_lemonade_stand data table:

cursor.execute('CREATE TABLE organic_lemonade_stand (owner TEXT, location TEXT, weekly_expenses_avg INT, weekly_sales_avg INT)')
cursor.execute('INSERT INTO organic_lemonade_stand VALUES ("Lily", "Florida", 100, 500)')
cursor.execute('INSERT INTO organic_lemonade_stand VALUES ("Summer", "California", 300, 800)')
cursor.execute('INSERT INTO organic_lemonade_stand VALUES ("Layla", "New York", 200, 250)')
cursor.execute('INSERT INTO organic_lemonade_stand VALUES ("Beatrice", "Alaska", 50, 60)')
cursor.execute('INSERT INTO organic_lemonade_stand VALUES ("Carla", "Massachusetts", 250, 300)')
sql_database_connection.commit()

The below code helps select all rows in the organic_lemonade_stand data table:

cursor.execute('SELECT * FROM organic_lemonade_stand')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
      owner       location  weekly_expenses_avg  weekly_sales_avg
0      Lily        Florida                  100               500
1    Summer     California                  300               800
2     Layla       New York                  200               250
3  Beatrice         Alaska                   50                60
4     Carla  Massachusetts                  250               300

6.2.7.2 Exploring PROJECTION with organic_lemonade_stand database

The owner column is being selected

cursor.execute('SELECT owner FROM organic_lemonade_stand')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
      owner
0      Lily
1    Summer
2     Layla
3  Beatrice
4     Carla

6.2.7.3 Exploring SELECTION & PROJECTION combined

The owner column is being selected (projection), and rows where weekly_expenses_avg is greater than 100 is also selected (selection).

cursor.execute('SELECT owner, weekly_expenses_avg FROM organic_lemonade_stand WHERE weekly_expenses_avg > 100')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
    owner  weekly_expenses_avg
0  Summer                  300
1   Layla                  200
2   Carla                  250

6.2.7.4 Exploring JOINING with organic_lemonade_stand data table and sparkling_lemon_water data table

The below code creates the sparkling_lemon_water data table:

cursor.execute('CREATE TABLE sparkling_lemon_water (owner TEXT, location TEXT, weekly_expenses_avg INT, weekly_sales_avg INT)')
cursor.execute('INSERT INTO sparkling_lemon_water VALUES ("Daisy", "Florida", 20, 65)')
cursor.execute('INSERT INTO sparkling_lemon_water VALUES ("Kathy", "California", 65, 160)')
cursor.execute('INSERT INTO sparkling_lemon_water VALUES ("Samantha", "New York", 85, 280)')
cursor.execute('INSERT INTO sparkling_lemon_water VALUES ("Samantha", "New York", 85, 280)')
sql_database_connection.commit()
6.2.7.4.1 INNER JOIN: A type of SQL join where rows that are common in both tables are presented in the output.
cursor.execute('SELECT * FROM organic_lemonade_stand INNER JOIN sparkling_lemon_water ON organic_lemonade_stand.location = sparkling_lemon_water.location')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
    owner    location  weekly_expenses_avg  weekly_sales_avg     owner  \
0    Lily     Florida                  100               500     Daisy   
1  Summer  California                  300               800     Kathy   
2   Layla    New York                  200               250  Samantha   
3   Layla    New York                  200               250  Samantha   

     location  weekly_expenses_avg  weekly_sales_avg  
0     Florida                   20                65  
1  California                   65               160  
2    New York                   85               280  
3    New York                   85               280  
6.2.7.4.2 LEFT JOIN: A type of SQL join where every row in the left table is shown in the output, along with rows in the right table that match the left table
cursor.execute('SELECT * FROM organic_lemonade_stand LEFT JOIN sparkling_lemon_water ON organic_lemonade_stand.location = sparkling_lemon_water.location')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
      owner       location  weekly_expenses_avg  weekly_sales_avg     owner  \
0      Lily        Florida                  100               500     Daisy   
1    Summer     California                  300               800     Kathy   
2     Layla       New York                  200               250  Samantha   
3     Layla       New York                  200               250  Samantha   
4  Beatrice         Alaska                   50                60      None   
5     Carla  Massachusetts                  250               300      None   

     location  weekly_expenses_avg  weekly_sales_avg  
0     Florida                 20.0              65.0  
1  California                 65.0             160.0  
2    New York                 85.0             280.0  
3    New York                 85.0             280.0  
4        None                  NaN               NaN  
5        None                  NaN               NaN  
6.2.7.4.3 RIGHT JOIN: A type of SQL join where every row in the right table is shown in the output, along withh rows in the left table that match the right table.
cursor.execute('SELECT * FROM organic_lemonade_stand RIGHT JOIN sparkling_lemon_water ON organic_lemonade_stand.location = sparkling_lemon_water.location')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
    owner    location  weekly_expenses_avg  weekly_sales_avg     owner  \
0    Lily     Florida                  100               500     Daisy   
1  Summer  California                  300               800     Kathy   
2   Layla    New York                  200               250  Samantha   
3   Layla    New York                  200               250  Samantha   

     location  weekly_expenses_avg  weekly_sales_avg  
0     Florida                   20                65  
1  California                   65               160  
2    New York                   85               280  
3    New York                   85               280  

6.2.8 Using Aliases In DML Statements

To reference a table, it may be easier to “alias” the table or shorten the table’s name, to make the code more readable. Below is the JOINING command, but with table aliases:

cursor.execute('SELECT * FROM organic_lemonade_stand lemon INNER JOIN sparkling_lemon_water sparkling ON lemon.location = sparkling.location')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
    owner    location  weekly_expenses_avg  weekly_sales_avg     owner  \
0    Lily     Florida                  100               500     Daisy   
1  Summer  California                  300               800     Kathy   
2   Layla    New York                  200               250  Samantha   
3   Layla    New York                  200               250  Samantha   

     location  weekly_expenses_avg  weekly_sales_avg  
0     Florida                   20                65  
1  California                   65               160  
2    New York                   85               280  
3    New York                   85               280  

6.2.9 Using DISTINCT in DML

The sparkling_lemon_water contains duplicate rows for the owner, Samantha. To only return one distinct row, we need to use the DISTINCT command.

cursor.execute('SELECT DISTINCT owner, location, weekly_expenses_avg, weekly_sales_avg FROM sparkling_lemon_water')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
      owner    location  weekly_expenses_avg  weekly_sales_avg
0     Daisy     Florida                   20                65
1     Kathy  California                   65               160
2  Samantha    New York                   85               280

6.2.10 Logical Operators

    1. AND: If both conditions are TRUE, the entire expression is TRUE
    1. OR: If at least one condition is TRUE, the entire expression is TRUE
    1. NOT: Returns TRUE if NOT FALSE, and Returns FALSE if NOT TRUE

6.2.10.1 Example of AND - Logical Operators

cursor.execute('SELECT * FROM organic_lemonade_stand WHERE weekly_expenses_avg < 300 AND weekly_sales_avg > 200')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
   owner       location  weekly_expenses_avg  weekly_sales_avg
0   Lily        Florida                  100               500
1  Layla       New York                  200               250
2  Carla  Massachusetts                  250               300

6.2.10.2 Example of OR - Logical Operators

cursor.execute('SELECT * FROM organic_lemonade_stand WHERE weekly_expenses_avg < 300 OR weekly_sales_avg > 500')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
      owner       location  weekly_expenses_avg  weekly_sales_avg
0      Lily        Florida                  100               500
1    Summer     California                  300               800
2     Layla       New York                  200               250
3  Beatrice         Alaska                   50                60
4     Carla  Massachusetts                  250               300

6.2.10.3 Example of NOT - Logical Operators

cursor.execute('SELECT * FROM organic_lemonade_stand WHERE NOT weekly_expenses_avg < 100')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
    owner       location  weekly_expenses_avg  weekly_sales_avg
0    Lily        Florida                  100               500
1  Summer     California                  300               800
2   Layla       New York                  200               250
3   Carla  Massachusetts                  250               300

6.2.11 Arithmetic Operators

We can create a new alias, weekly_profits_avg, by subtracting the weekly_expenses_avg column from the weekly_sales_column, using arithmetic operators.

cursor.execute('SELECT weekly_sales_avg - weekly_expenses_avg AS weekly_profits_avg FROM organic_lemonade_stand')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
   weekly_profits_avg
0                 400
1                 500
2                  50
3                  10
4                  50

6.2.12 Concatenation Operators

We can concatenate two columns together using a special concatenation operator:

cursor.execute("SELECT owner||': '||location AS 'owner location' FROM organic_lemonade_stand")
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
         owner location
0         Lily: Florida
1    Summer: California
2       Layla: New York
3      Beatrice: Alaska
4  Carla: Massachusetts

In this example, the owner and location columns are being concatenated to form the owner location column.

6.2.13 ORDER BY and DESC

We can find what owner is the most profitable by using the ORDER BY operator. ORDER BY always orderes entries from least to greatest. However, ORDER BY (insert name of column) DESC to order the columns from greatest to least.

ORDER BY Example:

cursor.execute('SELECT * FROM organic_lemonade_stand ORDER BY weekly_expenses_avg')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
      owner       location  weekly_expenses_avg  weekly_sales_avg
0  Beatrice         Alaska                   50                60
1      Lily        Florida                  100               500
2     Layla       New York                  200               250
3     Carla  Massachusetts                  250               300
4    Summer     California                  300               800

ORDER BY DESC Example:

cursor.execute('SELECT * FROM organic_lemonade_stand ORDER BY weekly_expenses_avg DESC')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
      owner       location  weekly_expenses_avg  weekly_sales_avg
0    Summer     California                  300               800
1     Carla  Massachusetts                  250               300
2     Layla       New York                  200               250
3      Lily        Florida                  100               500
4  Beatrice         Alaska                   50                60

6.2.14 Set Operators

    1. UNION: Produces a set combining outputs created by 2 SELECT queries. Any rows that are duplicated are not part of the resulting set.
    1. UNION ALL: Similar to UNION, this produces a set combining outputs created by 2 SELECT queries. However, any rows that are duplicated are displayed in the resulting set.
    1. INTERSECT: Finds rows that are shared by both tables/SELECT queries. The output contains distinct rows.
    1. MINUS/EXCEPT: Finds rows that the first SELECT query has but the second SELECT query does not have. These rows are distinct. Note that MINUS is not compatible with SQLITE, and is only compatible with ORACLE

Below is code that creates the numerical_data table:

cursor.execute('CREATE TABLE numerical_data (numbers INT)')
cursor.execute('INSERT INTO numerical_data VALUES (1)')
cursor.execute('INSERT INTO numerical_data VALUES (1)')
cursor.execute('INSERT INTO numerical_data VALUES (1)')
cursor.execute('INSERT INTO numerical_data VALUES (2)')
cursor.execute('INSERT INTO numerical_data VALUES (3)')
cursor.execute('INSERT INTO numerical_data VALUES (4)')
cursor.execute('INSERT INTO numerical_data VALUES (10)')
<sqlite3.Cursor at 0x117de0ec0>

Below is code that creates the cool_numbers_data table:

cursor.execute('CREATE TABLE cool_numbers_data (cool_numbers INT)')
cursor.execute('INSERT INTO cool_numbers_data VALUES (1)')
cursor.execute('INSERT INTO cool_numbers_data VALUES (2)')
cursor.execute('INSERT INTO cool_numbers_data VALUES (3)')
cursor.execute('INSERT INTO cool_numbers_data VALUES (4)')
cursor.execute('INSERT INTO cool_numbers_data VALUES (5)')
<sqlite3.Cursor at 0x117de0ec0>

Example of the UNION OPERATOR: Finds the union of the data tables, numerical_data and cool_numbers_data:

cursor.execute('SELECT numbers FROM numerical_data UNION SELECT cool_numbers FROM cool_numbers_data')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
   numbers
0        1
1        2
2        3
3        4
4        5
5       10

Example of the UNION ALL OPERATOR: Finds the UNION ALL of the data tables, numerical_data and cool_numbers_data:

cursor.execute('SELECT numbers FROM numerical_data UNION ALL SELECT cool_numbers FROM cool_numbers_data')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
    numbers
0         1
1         1
2         1
3         2
4         3
5         4
6        10
7         1
8         2
9         3
10        4
11        5

Example of the INTERSECT OPERATOR: Finds the intersection of the data tables, numerical_data and cool_numbers_data:

cursor.execute('SELECT numbers FROM numerical_data INTERSECT SELECT cool_numbers FROM cool_numbers_data')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
   numbers
0        1
1        2
2        3
3        4

Example of the EXCEPT OPERATOR: Finds the execption of the data tables, numerical_data and cool_numbers_data:

cursor.execute('SELECT numbers FROM numerical_data EXCEPT SELECT cool_numbers FROM cool_numbers_data')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
   numbers
0       10

6.2.15 Subqueries

  • Subqueries are smaller queries within a larger query. When subqueries are evaluated, they help filter the data, and this filter can applied to the larger query.

6.2.15.1 Two Types of Subqueries

  • The two types of subqueries are Inline Views and Nested Subqueries
6.2.15.1.1 Inline Views
  • When a SELECT statement has a FROM cause that contains a subquery, this is called an Inline View.

Example:

cursor.execute('SELECT owner, weekly_sales_avg, weekly_profit_avg/5 AS daily_profit_avg FROM (SELECT owner, weekly_sales_avg, weekly_sales_avg - weekly_expenses_avg AS weekly_profit_avg FROM organic_lemonade_stand) ORDER BY daily_profit_avg')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
      owner  weekly_sales_avg  daily_profit_avg
0  Beatrice                60                 2
1     Layla               250                10
2     Carla               300                10
3      Lily               500                80
4    Summer               800               100
6.2.15.1.2 Nested Subquery
  • When a SELECT statement has a WHERE clause that contains a subquery, this is called a Nested Subquery.

Example 1:

cursor.execute('SELECT owner, weekly_expenses_avg, weekly_sales_avg FROM organic_lemonade_stand WHERE weekly_sales_avg > (SELECT weekly_sales_avg FROM sparkling_lemon_water)')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
    owner  weekly_expenses_avg  weekly_sales_avg
0    Lily                  100               500
1  Summer                  300               800
2   Layla                  200               250
3   Carla                  250               300

Example 2:

cursor.execute('SELECT owner, weekly_expenses_avg, weekly_sales_avg FROM organic_lemonade_stand WHERE weekly_sales_avg < (SELECT weekly_sales_avg FROM sparkling_lemon_water)')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
      owner  weekly_expenses_avg  weekly_sales_avg
0  Beatrice                   50                60

6.2.15.2 Classes of Subqueries

  • Subqueries can be categorized further into three classes.
  • The three classes of Subqueries are as follows: Single-row subqueries, Multiple-row subqueries, and Correlated subqueries
6.2.15.2.1 Single Row Subqueries
  1. Single-row subqueries: Subqueries that output a single row, which is then evaluated by the outer query

Example:

cursor.execute('SELECT owner, weekly_sales_avg, weekly_expenses_avg FROM organic_lemonade_stand WHERE weekly_expenses_avg <= 300 AND weekly_sales_avg > (SELECT AVG(weekly_sales_avg) AS mean_of_weekly_sales_avg FROM organic_lemonade_stand)')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
    owner  weekly_sales_avg  weekly_expenses_avg
0    Lily               500                  100
1  Summer               800                  300
6.2.15.2.2 Multiple-Row Subqueries
    1. Multiple-row subqueries: Subqueries that output many rows, which are then evluated by the outer query

Example:

cursor.execute('CREATE TABLE successful_organic_business_owners (successful_owner_name TEXT, weekly_revenue INT)')
sql_database_connection.commit()

cursor.execute('INSERT INTO successful_organic_business_owners SELECT owner, weekly_sales_avg - weekly_expenses_avg AS weekly_profits_avg FROM organic_lemonade_stand WHERE weekly_sales_avg - weekly_expenses_avg > 200')
cursor.execute('SELECT * FROM successful_organic_business_owners')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
  successful_owner_name  weekly_revenue
0                  Lily             400
1                Summer             500
6.2.15.2.3 Correlated Subqueries
    1. Correlated subqueries: Subqueries that depend on the outer query because they reference column information from the outer query

Example:

cursor.execute('SELECT weekly_sales_avg FROM organic_lemonade_stand WHERE weekly_sales_avg > (SELECT weekly_sales_avg FROM sparkling_lemon_water WHERE organic_lemonade_stand.weekly_expenses_avg > sparkling_lemon_water.weekly_expenses_avg)')
rows_fetched = cursor.fetchall()
columns_fetched = [metadata_description[0] for metadata_description in cursor.description]
data_frame_fetched = pd.DataFrame(rows_fetched, columns=columns_fetched)
print(data_frame_fetched)
   weekly_sales_avg
0               500
1               800
2               250
3               300

6.2.16 Data Visualization Using SQL

How to Present SQL table result using matplotlib:

  • First, import matplotlib
import matplotlib.pyplot as plt
  • Create a SQL query that you would like to visualize via matplotlib
data_frame_to_be_visualized = pd.read_sql_query("""SELECT location, weekly_expenses_avg FROM organic_lemonade_stand""", sql_database_connection)
print(data_frame_to_be_visualized)
        location  weekly_expenses_avg
0        Florida                  100
1     California                  300
2       New York                  200
3         Alaska                   50
4  Massachusetts                  250
  • Use matplotlib commands to construct and display the appropriate visual:
plt.bar(data_frame_to_be_visualized["location"], data_frame_to_be_visualized["weekly_expenses_avg"])
plt.xlabel("Location")
plt.ylabel("Average Weekly Expenses")
plt.title("Average Weekly Expenses by Location")
plt.show()

6.2.17 How to Close a SQL Database Connection

sql_database_connection.close()

6.2.17.1 Further Resources:

6.2.17.2 Books:

  • Oracle Database 12C SQL by Matthew Morris
  • Geeks for Geeks SQL Manual
  • Learning SQL by Alan Beaulieu #### Websites:
  • https://www.geeksforgeeks.org/python/python-sqlite-connecting-to-database/
  • https://docs.python.org/3/library/sqlite3.html
  • https://datacarpentry.github.io/python-ecology-lesson/instructor/09-working-with-sql.html
  • https://www.geeksforgeeks.org/python/python-sqlite/
  • https://datacarpentry.github.io/python-socialsci/instructor/14-sqlite.html

6.3 Web Scraping

The presentation was made by Sean Tessman.

6.3.1 Introduction

6.3.1.1 What is Web Scraping?

  • Web Scraping is essentially using code to automatically pull information from websites, rather than copy-pasting it by hand.
  • Websites organize their data using HTML, which is just the basic structure behind what you see on a webpage.
  • Scraping lets us take information from a website and turn it into an organized table of data that we can actually work with and analyze.

6.3.2 Why It Matters in Data Science

  • A lot of real-world data isn’t properly available as a downloadable CSV file.
  • Web scraping helps us to:
    • Gather information that’s publicly available online.
    • Create our own datasets from websites
    • Save us time by collecting data automatically
  • It helps turn raw information from the web into something we can analyze and visualize.

6.3.3 How Web Pages Work

6.3.4 HTML Basics

Web pages are structured using HTML tags such as:

  • <div> –> Used to group sections of a page
  • <p> –> Used for paragraphs of text
  • <a> –> Used for links
  • <table> –> Used to display data in rows and columns

When we scrape a website, we look for the specific tag that contains the information we want and pull that content out.


6.3.5 Static vs Dynamic Websites

6.3.5.1 Static websites (easier to scrape)

  • The data is already in the HTML you downloaded.
  • When you request the page, the content you need is usually right there.
  • Tools like requests, pandas.read_html(), and BeautifulSoup work well.

6.3.5.2 Dynamic websites (harder to scrape)

  • The page loads, then JavaScript pulls in the data after.
  • The first HTML you download might be missing the info you want.
  • You may need Selenium/Playwright, or sometimes an API that the page is calling.

6.3.5.3 Why this matters

If a site is dynamic, your scraper might “work” but return empty results.


6.3.6 The Basic Workflow

  1. Pick a website
    Make sure the data is public and allowed to be collected.

  2. Download the page
    Send a request to get the website’s HTML.

  3. Extract the data
    Pull the specific tables, text, or elements you need.

  4. Clean and analyze
    Organize the data into a usable format and create visuals or summaries.


6.3.7 Using BeautifulSoup

6.3.7.1 What is BeautifulSoup

BeautifulSoup is a Python tool that helps you look through and pull information from a webpage’s HTML code. While pandas.read_html() is useful for grabbing whole tables, BeautifulSoup is better when you want specific things like text, links, or section titles. It works by organizing the HTML into a structure that makes it easier to search and extract exactly what you need.

6.3.7.2 Why use BeautifulSoup?

  • Get text that isn’t in a table
  • Grab links or other small pieces of information
  • Look for specific parts of a page, like div, p, or span
  • Work with more complicated pages

It’s most helpful when the data isn’t neatly organized in a table.

For structured tables –> use pandas.read_html() 

For flexible HTML parsing –> use BeautifulSoup


6.3.8 Python Setup

6.3.8.1 Required Libraries

import requests
import pandas as pd
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt

6.3.9 Example: Scraping a Public Table (Wikipedia)

6.3.9.1 Step 1: Choose a Page

We’ll scrape a public population table from Wikipedia:

url = "https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
url
'https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population'

6.3.9.2 Step 2: Request the Page

headers = {
    "User-Agent": "DataSciencePresentaion-WebScraping"
}

response = requests.get(url, headers=headers, timeout=30)
response.status_code
200
6.3.9.2.1 What The Code Does:
  • Sets up a header for the website to know who is making the request

  • Sends a request to the webpage

  • waits up to 30 seconds for a response

  • Saves the webpage’s response in a variable called response

  • Prints the status code from the website


6.3.9.3 Step 3: Extract the Table into a DataFrame

Instead of going through each table row and cell by one, we can use pandas.read_html() to automatically grab the whole table from the webpage.

from io import StringIO

tables = pd.read_html(StringIO(response.text))
len(tables)
6

Wikipedia pages usually have several tables on them. We’ll just pick the main sortable table that has the “State or territory” column in it.

df = None
for t in tables:
    cols = [str(c).lower() for c in t.columns]
    if any("state or territory" in c for c in cols):
        df = t
        break

df.head()
State or territory Census population[8][9][a] House seats[b] Pop. per elec. vote (2020)[c] Pop. per seat (2020)[a] % US (2020) % EC (2020)
State or territory July 1, 2025 (est.) April 1, 2020 Seats % Pop. per elec. vote (2020)[c] Pop. per seat (2020)[a] % US (2020) % EC (2020)
0 California 39355309.0 39538223 52 11.95% 732189 760350 11.800% 10.04%
1 Texas 31709821.0 29145505 38 8.74% 728638 766987 8.698% 7.43%
2 Florida 23462518.0 21538187 28 6.44% 717940 769221 6.428% 5.58%
3 New York 20002427.0 20201249 26 5.98% 721473 776971 6.029% 5.20%
4 Pennsylvania 13059432.0 13002700 17 3.91% 684353 764865 3.881% 3.53%

6.3.9.4 Step 4: Clean the Data

This table usually includes a few “summary” rows that aren’t actual states. We’ll remove them and clean the population column.

df.columns = [str(c).strip().lower().replace(" ", "_") for c in df.columns]

state_candidates = [c for c in df.columns if "state" in c and "territory" in c]
if not state_candidates:
    state_candidates = [c for c in df.columns if "state" in c]

if not state_candidates:
    raise ValueError(f"Couldn't find a state column. Columns are: {df.columns.tolist()}")

state_col = state_candidates[0]

pop_candidates = [c for c in df.columns if "population" in c]
if not pop_candidates:
    raise ValueError(f"Couldn't find a population column. Columns are: {df.columns.tolist()}")

pop_col = pop_candidates[0]

df["population"] = (
    df[pop_col].astype(str)
      .str.replace(",", "", regex=False)
)

df["population"] = pd.to_numeric(df["population"], errors="coerce")

df["state_clean"] = (
    df[state_col].astype(str).str.strip().str.lower().str.replace(" ", "_", regex=False)
)

bad_rows = {"the_50_states_and_d.c.", "the_50_states", "contiguous_united_states"}
df_clean = df[~df["state_clean"].isin(bad_rows)].copy()

df_clean = df_clean.dropna(subset=["population"])

df_clean[[state_col, "population"]].head()
('state_or_territory',_'state_or_territory') population
0 California 39355309.0
1 Texas 31709821.0
2 Florida 23462518.0
3 New York 20002427.0
4 Pennsylvania 13059432.0

This section cleans and prepares the data so it’s ready to analyze. It standardizes the column names, finds the correct state and population columns, and converts the population values into numbers. It also removes summary rows and any missing values so we are only working with actual states and usable data.


6.3.10 Example Visualization

6.3.10.1 Top 10 States by Population

top10 = df_clean.sort_values("population", ascending=False).head(10)

plt.figure(figsize=(10, 6))

plt.barh(top10[state_col].astype(str)[::-1], top10["population"][::-1])

plt.title("Top 10 U.S. States by Population")
plt.xlabel("Population")
plt.ylabel("")
plt.tight_layout()
plt.show()


6.3.11 Saving the Data

6.3.11.1 Export to CSV

df_clean.to_csv("scraped_population_table.csv", index=False)
"Saved: scraped_population_table.csv"
'Saved: scraped_population_table.csv'

6.3.12 Scraping Text (Not Just Tables)

6.3.12.1 Example: Pull a Few Headings with BeautifulSoup

This here shows how scraping isn’t just for tables.

soup = BeautifulSoup(response.text, "html.parser")

headings = [h.get_text(strip=True) for h in soup.find_all(["h2", "h3"])]

headings[:10]
['Contents',
 'Method',
 'Electoral apportionment',
 'House of Representatives',
 'Electoral College',
 'State and territory rankings',
 'Summary of population by region',
 'See also',
 'Notes',
 'References']

6.3.13 Scraping Responsibly

  • Look at the website’s robots.txt file and read its rules before scraping (Terms of Service)
  • Don’t send too many requests too quickly –> add pauses if you’re looping.
  • Stick to publicly available information.
  • Be considerate and make sure your scraping doesn’t slow the site down.

6.3.14 Limitations

  • Websites can change their layout, which can cause your scraping code to stop working.
  • Some sites load content using JavaScript, so you might need a tool like Selenium or an API instead
  • There are legal and ethical rules you have to follow when scraping.

6.3.15 Conclusion

  • Web scraping helps you collect public data from websites automatically.
  • It’s useful when the data isn’t already in a CSV.
  • Tables can be pulled with pandas.read_html(), and text/links can be pulled with BeautifulSoup.
  • APIs are usually more stable when they exist.
  • Scraping can break if a website changes, so it needs maintenance.

6.3.16 References

Wikipedia contributors. (2025). List of U.S. states and territories by population. Wikipedia.
https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population

CCS Learning Academy. (2024). What is web scraping?
https://www.ccslearningacademy.com/what-is-web-scraping/

Real Python. (2024). Beautiful Soup: Build a web scraper with Python.
https://realpython.com/beautiful-soup-web-scraper-python/