7 Data Wrangling & Visualization

7.1 Data Manipulation with Pandas

This presentation is prepared by Jinha.

7.1.1 What is Pandas?

Pandas is a Python library for data manipulation and analysis.
It is designed to work with structured or tabular data.
It provides flexible and efficient data structures.
It is widely used in data science and research.

7.1.2 Why is Data Manipulation Important?

Real-world data is often incomplete or inconsistent.
Datasets may contain missing values or duplicate entries.
Raw data is usually not ready for analysis.
Poor data quality can lead to incorrect conclusions.
Data manipulation improves accuracy and reliability.

7.1.3 Why Use Pandas for Data Manipulation?

Pandas provides powerful and intuitive tools for cleaning data.
It simplifies complex data operations into a few lines of code.
It is more efficient and scalable than spreadsheet software.
It integrates smoothly with other Python libraries.
It supports reproducible and automated workflows.

7.1.4 Core Concepts

7.1.4.1 DataFrame and Series

A DataFrame is a two-dimensional table with rows and columns.
A Series is a one-dimensional labeled array.
Each row represents an observation.
Each column represents a variable.
DataFrames allow us to perform operations across rows and columns.

7.1.4.2 Importing Data

Pandas allows us to import data from various file formats.
The most common format is CSV.
After importing, the data is stored in a DataFrame.

import pandas as pd

df = pd.read_csv("data.csv")
df.head()

7.1.5 Key Operations

7.1.5.1 Selecting Data

Selecting allows us to focus on specific variables.
This reduces unnecessary complexity in the dataset.
It improves clarity and efficiency.

df["score"]

df[["age", "score"]]

‘df[“score”]’ selects a single column.
‘df[[“age”, “score”]]’ selects multiple columns.

7.1.5.2 Filtering Rows

Filtering allows us to select observations that meet specific conditions.
This is useful when we want to analyze a specific group.
For example, we may only want students above a certain score.
Filtering helps reduce noise in the dataset.

df[df["score"] > 80]

‘df[df[“score”] > 80]’ selects rows where score is greater than 80.
We can also combine multiple conditions.

df[(df["score"] > 80) & (df["age"] > 18)]

‘df[(df[“score”] > 80) & (df[“age”] > 18)]’ applies multiple conditions at the same time.

Filtering helps focus on relevant observations.

7.1.5.3 Grouping and Aggregation

Grouping allows us to divide data into categories.
Aggregation summarizes values within each group.
Common functions include mean, sum, and count.

df.groupby("gender")["score"].mean()

df.groupby("region")["sales"].sum()

‘df.groupby(“gender”)[“score”].mean()’ groups data by gender and calculates the average score.

7.1.5.4 Handling Missing Data

Missing values are common in real datasets.
They must be handled before analysis.
Pandas provides built-in functions for detecting and treating missing data.
Unhandled missing values may bias statistical results.

df.isna().sum()

df.dropna()

df.fillna(0)

‘df.isna()’ checks where missing values exist.
‘df.dropna()’ removes rows with missing values.
‘df.fillna(0)’ replaces missing values with a specified value (in this case, 0).

7.1.5.5 Sorting Data

Sorting helps organize data in ascending or descending order.
It is useful for identifying highest or lowest values.

df.sort_values("score", ascending=False)

‘df.sort_values(“score”, ascending=False)’ sorts the dataset by score in descending order.

7.1.5.6 Common Mistakes in Data Manipulation

Ignoring missing values
Filtering too aggressively
Misinterpreting grouped results
Not checking for data types
Assuming the data is clean without verification

7.1.5.7 How to Avoid Common Mistakes

Always check for missing values before analysis
Filter carefully and review the remaining data
Verify group sizes when using groupby
Inspect data types using df.dtypes

7.1.6 Example Workflow

7.1.6.1 From Raw Data to Summary

Load the dataset.
Inspect the structure of the data.
Handle missing values.
Filter relevant observations.
Group and summarize results.

df = pd.read_csv("data.csv")

df = df.dropna()

filtered = df[df["score"] > 80]

summary = filtered.groupby("gender")["score"].mean()

summary

7.1.7 Why Pandas Is Important

7.1.7.1 Applications

Data preprocessing
Exploratory data analysis
Preparing datasets for machine learning
Academic research and industry analytics

Pandas is typically the first step in a data science workflow, as it allows us to clean and prepare our data for further analysis.

7.1.8 Conclusion

7.1.8.1 Summary

This presentation covered:

The role of Pandas in data manipulation and analysis
How Pandas works with structured tabular data
Key operations such as selecting, filtering, grouping and handling missing values
Common mistakes and best practices in data manipulation

Effective data manipulation leads to more reliable and accurate analysis, which is crucial for making informed decisions based on data.

7.1.8.2 Further Reading

For more information about Pandas:

Thank you for reading.

7.2 Grammar of Graphics

This presentation was prepared by Joseph Landolphi.

This presentation explains what the Grammar of Graphics is, why it is important in modern data science, and how it is used in sports analytics such as baseball.

7.2.1 Introduction

Data visualization is a fundamental part of modern data science. Analysts must interpret large datasets and communicate insights clearly. Instead of thinking about visualization as simply choosing a chart type, the Grammar of Graphics provides a structured framework for building visualizations from components.

7.2.2 What is the Grammar of Graphics?

The Grammar of Graphics was introduced by Leland Wilkinson and describes visualization as a layered system. Rather than asking “what chart should I use?”, analysts ask:

What data am I using?
What variables should be mapped to visual properties?
What geometric shapes should represent observations?
What scales and coordinate systems improve interpretation?

This approach makes visualization systematic and reproducible.

7.2.3 Why the Grammar of Graphics Matters in Modern Data Science

Key benefits include:

Structured thinking about visualization
Reproducibility
Ability to layer information
Scalability for large datasets
Integration into dashboards and automated workflows

Many modern visualization libraries are influenced by this framework.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(42)

data = pd.DataFrame({
    "exit_velocity": np.random.normal(90, 5, 300),
    "launch_angle": np.random.normal(15, 10, 300),
    "hit_distance": np.random.normal(380, 30, 300),
})

data.head()

	exit_velocity	launch_angle	hit_distance
0	92.483571	6.710050	402.709658
1	89.308678	9.398190	352.335040
2	93.238443	22.472936	406.088178
3	97.615149	21.103703	420.669136
4	88.829233	14.790984	392.403047

7.2.4 Component 1: Data

Every visualization begins with a dataset.

Modern baseball tracking systems collect variables such as:

Exit velocity
Launch angle
Hit distance
Pitch location
Player identity

These variables form the foundation of visual analysis.

plt.figure()
plt.scatter(data["exit_velocity"], data["launch_angle"])
plt.xlabel("Exit Velocity (mph)")
plt.ylabel("Launch Angle (degrees)")
plt.title("Baseball Contact Profile")
plt.show()

7.2.5 Component 2: Aesthetic Mappings

Aesthetic mappings connect data variables to visual properties such as:

Position
Color
Size

In this example:

Exit velocity is mapped to the x-axis
Launch angle is mapped to the y-axis

Each point represents a batted ball.

plt.figure()
plt.scatter(
    data["exit_velocity"],
    data["launch_angle"],
    c=data["hit_distance"],
)
plt.xlabel("Exit Velocity")
plt.ylabel("Launch Angle")
plt.title("Contact Profile Colored by Hit Distance")
plt.colorbar(label="Hit Distance (ft)")
plt.show()

7.2.6 Component 3: Scales

Scales determine how data values are translated into visual values.

Here, hit distance is represented using a color gradient. This allows multiple variables to be displayed simultaneously, increasing information density.

plt.figure()
plt.scatter(data["launch_angle"], data["hit_distance"])
plt.xlabel("Launch Angle (degrees)")
plt.ylabel("Hit Distance (feet)")
plt.title("Relationship Between Launch Angle and Hit Distance")
plt.show()

7.2.7 Component 4: More Plot Objects

Geometric objects define the shapes used to represent data.

Examples include:

Points for scatter plots
Bars for comparisons
Lines for trends

This example adds a trend line to the raw observations.

x = data["exit_velocity"]
y = data["launch_angle"]

plt.figure()
plt.scatter(x, y)

z = np.polyfit(x, y, 1)
p = np.poly1d(z)

plt.plot(x, p(x))
plt.title("Layered Visualization with Trend Line")
plt.show()

7.2.8 Component 5: Layering

One of the most powerful ideas in the Grammar of Graphics is layering.

Visualizations can combine:

Raw observations
Statistical summaries

This allows deeper insight into relationships between variables.

import plotly.express as px

fig = px.scatter_3d(
    data,
    x="exit_velocity",
    y="launch_angle",
    z="hit_distance",
    title="Interactive 3D Scatter Plot",
    opacity=0.8,
)

fig.show()

angle = np.random.uniform(-45, 45, 300)
distance = np.random.normal(350, 40, 300)

x = distance * np.cos(np.radians(angle))
y = distance * np.sin(np.radians(angle))

plt.figure()
plt.scatter(x, y)
plt.title("Baseball Spray Chart")
plt.axis("equal")
plt.show()

7.2.9 Component 6: Coordinate Systems

Coordinate systems determine how data are positioned in space.

In baseball analytics, spray charts help analysts study:

Hitting tendencies
Defensive positioning
Player development

pitch_x = np.random.normal(0, 0.8, 500)
pitch_y = np.random.normal(2.5, 0.7, 500)

plt.figure()
plt.hist2d(pitch_x, pitch_y, bins=30)
plt.colorbar(label="Pitch Density")
plt.title("Pitch Location Heatmap")
plt.show()

7.2.10 Modern Baseball Analytics Application

Heatmaps like this are used to analyze:

Pitch tendencies
Strike zone control
Batter weaknesses

Grammar of Graphics principles allow teams to build consistent visual tools for decision-making.

7.2.11 Conclusion

The Grammar of Graphics provides a structured framework for building visualizations.

7.2.11.1 Key Takeaways

Visualizations are built from modular components
Data variables are mapped into visual aesthetics
Geometric objects represent observations
Layering increases analytical depth
This framework is widely used in modern data science and sports analytics

By using these principles, analysts can transform complex datasets into actionable insights.

7.3 Visualizing Spatial Data

This presentation is prepared by Cody Jones.

7.3.1 Visualizing Spatial Data

Table of contents - Definition - Different Types(Vector, Raster) - Visualization Methods - Importance of Map Projections - Design Principles - Tools and Technology - Questions

7.3.2 What is Spatial Data?

Visualizing spatial data is the process of displaying data that has a geographic component on a map. This helps us see patterns across locations, compare regional patterns, identify any trends or clusters, and understand how place influences outcome. Regular data might tell you information such as income, temperature, and population. But spatial data takes a look “income where? Temperature where? Population where?” Location changes everything.

7.3.3 Vector Vs Raster Data

Vector data describes the literal geometric level of the data. This includes the following details you would find on a map: - Points(crime incidents) - Lines(roads, rivers) - Polygons(territories, census tracts) Raster data describes grid data that doesn’t havea geometric feature to them. - Temperature - Elevation - Satellite Raster data is more often used in Environmental Studies

7.3.4 Visualization Methods(Maps)

7.3.4.1 Choropleth Maps

Encapsulates spatial data over a studied area
Regional, colored map
Usually follows one variable and its correlation to region
Runs the risk of being misleading if raw count is considered

import pandas as pd
import plotly.express as px

data = {
    "state": ["AL", "AK", "AZ", "AR", "CA",
              "CO", "CT", "DE", "FL", "GA"],
    "population": [4903185, 731545, 7278717, 3017804, 39512223,
                   5758736, 3565287, 973764, 21477737, 10617423]
}


df = pd.DataFrame(data)

fig = px.choropleth(
    df,
    locations="state",
    locationmode="USA-states",
    color="population",
    scope="usa",
    color_continuous_scale="Viridis",
    labels={"population": "Population"},
    title="US State Populations"
)

fig

7.3.4.2 Heat Maps

Focuses on the density of variables
Includes a variance in hue or intensity of color
2 Dimensional, covers the observations into categories between two variables
Unlike choropleth maps, heat maps do not require a geographic map

import pandas as pd
import numpy as np
import plotly.express as px

np.random.seed(42)

df = pd.DataFrame({
    "Income": np.random.normal(60000, 15000, 200),
    "Education Years": np.random.normal(16, 2, 200),
    "Age": np.random.normal(40, 10, 200),
    "Work Hours": np.random.normal(40, 5, 200),
})

# Create some relationships to make heatmap interesting
df["Spending"] = df["Income"] * 0.3 + np.random.normal(0, 5000, 200)
df["Savings"] = df["Income"] * 0.2 + np.random.normal(0, 3000, 200)

# Compute correlation matrix
corr = df.corr()

# Create heatmap
fig = px.imshow(
    corr,
    text_auto=True,
    color_continuous_scale="RdBu_r",
    title="Correlation Heatmap"
)

fig

7.3.4.3 Proportional Symbol Map

Uses symbol sizes as its main form of communication(commonly a circle)
Covers over geographical maps, similar to choropleth maps
Size of symbol expresses the condition of the variable
Comparable details to choropleth, substitutes the symbol for color changes

import pandas as pd
import plotly.express as px

# Sample population data (millions scale works best visually)
data = {
    "state": ["CA", "TX", "FL", "NY", "PA",
              "IL", "OH", "GA", "NC", "MI"],
    "population": [39.5, 29.0, 21.5, 19.8, 12.8,
                   12.6, 11.7, 10.6, 10.4, 10.0],
    "lat": [36.77, 31.97, 27.99, 42.95, 40.88,
            40.63, 40.42, 32.17, 35.78, 44.31],
    "lon": [-119.42, -99.90, -81.76, -75.53, -77.80,
            -89.40, -82.91, -82.90, -78.64, -85.60]
}

df = pd.DataFrame(data)

# Create proportional symbol map
fig = px.scatter_geo(
    df,
    lat="lat",
    lon="lon",
    size="population",
    hover_name="state",
    size_max=50,
    scope="usa",
    title="US State Populations (Proportional Symbol Map)",
    labels={"population": "Population (millions)"}
)

fig.update_layout(
    geo=dict(
        showland=True,
        landcolor="rgb(217, 217, 217)"
    )
)

fig

7.3.4.4 Dot Density Map

Covers a region and details occurances
The data is technically quantifiable with symbols and a legend
Dots can be approximate or accurate
Simple color variance allows for more variables
Usually covered over a map
Some drawbacks of dot density maps are that they could be congested/needs to be counted

import pandas as pd
import numpy as np
import plotly.express as px

# Sample population data (millions)
data = {
    "state": ["CA", "TX", "FL", "NY", "PA"],
    "population_millions": [39, 29, 21, 19, 12],
    "lat": [36.77, 31.97, 27.99, 42.95, 40.88],
    "lon": [-119.42, -99.90, -81.76, -75.53, -77.80]
}

df = pd.DataFrame(data)

# Create dot density dataset
dots = []

for _, row in df.iterrows():
    for _ in range(int(row["population_millions"])):  # 1 dot per million
        dots.append({
            "state": row["state"],
            "lat": row["lat"] + np.random.uniform(-1.5, 1.5),
            "lon": row["lon"] + np.random.uniform(-1.5, 1.5)
        })

dots_df = pd.DataFrame(dots)

# Create dot density map
fig = px.scatter_geo(
    dots_df,
    lat="lat",
    lon="lon",
    scope="usa",
    title="Dot Density Map (1 Dot = 1 Million People)",
)

fig.update_traces(marker=dict(size=3))
fig.update_layout(showlegend=False)

fig

7.3.4.5 Cartograms

Takes a regional map and distorts regions based on the given variable
Cartograms follow one variable
Geographical maps are necessary
Goofy looking

import pandas as pd
import plotly.express as px

# Sample population data (millions)
data = {
    "state": ["CA", "TX", "FL", "NY", "PA",
              "IL", "OH", "GA", "NC", "MI"],
    "population": [39.5, 29.0, 21.5, 19.8, 12.8,
                   12.6, 11.7, 10.6, 10.4, 10.0],
    "lat": [36.77, 31.97, 27.99, 42.95, 40.88,
            40.63, 40.42, 32.17, 35.78, 44.31],
    "lon": [-119.42, -99.90, -81.76, -75.53, -77.80,
            -89.40, -82.91, -82.90, -78.64, -85.60]
}

df = pd.DataFrame(data)

# Create cartogram-style proportional circle map
fig = px.scatter_geo(
    df,
    lat="lat",
    lon="lon",
    size="population",
    color="population",
    hover_name="state",
    size_max=60,
    scope="usa",
    color_continuous_scale="Plasma",
    title="US Population Cartogram (Circle-Based)"
)

fig.update_layout(
    geo=dict(
        showland=True,
        landcolor="rgb(240,240,240)"
    )
)

fig

7.3.5 Significance

7.3.5.1 Decision making

Strictly numerical data cannot determine conclusions influenced by geographical trends. The ability to view data through a map grants new eyes for the viewer. Visualizing spatial data allows for new levels of decision making and conclusion-drawing. Some examples of this are: - Detecting anomalies in network performance - Determining patterns based on seasonal variations - General urban planning

7.3.5.2 Map Projections

It’s important to understand the relationship between the world and maps. Earth is 3-Dimensional space and capturing and presenting is data can be challenging. While transferring geographical data into a workspace, some aspects might be distorted for digestion. This includes: - Area - Shape - Distance - Direction - Mercator Projection(full) - Robinson Projection(slight roundedness)

7.3.6 Design Principles

It’s important to keep visuals basic and easy for the viewer to understand. This includes following some general rules in order to keep information neat and understandable. Some of these rules include: - Colors(sequential vs diverging) - Classification Methods(Equal intervals, even quantiles) - Natural breaks(jenks: reduce variance within, maximize variance between) - Avoid misinterpretation(Rates vs totals, per capita, color-friendly)

7.3.7 Tools and Technologies

7.3.7.1 Tableau

Stand-out data visualization tool. Tableau creates interactive maps and visualizes geographic trends with strategies mentioned previously. It is able to connect location data to data sets. Primary tool for dashboards and storytelling. Prioritizes clean visuals over heavy analysis. - Shapefiles(GIS, geometric) - MapInfo Tables - KML Files(Google Earth)

7.3.7.2 R

Statistical programming language that is able to perform some spatial anaylsis. Handles GIS data(shapefiles, GeoJSON). Utilizes packages to create its advanced maps. R is a tool for analyzing spatial data statistically - sf: structure, rgdal: enables reading, sp: classes and methods - ggplot2: customizable data visualizations

7.3.7.3 Python

All-purpose programming language used for data science. It can focus on spatial data processing and GIS anaylsis. It can involve machine learning with location data and adopts libraries to accomplish its modeling and processing. - Geopandas: loads spatial data formats(.shp, geojson) - Matplotlib: plotting

7.3.7.4 Google Maps

A familiar mapping platform for the web. It provides basemaps and geographical context to the user. It visualizes spatial data in an interactive manner and embeds maps into apps. It’s the perfect platform for displaying and interacting with geographic data. - Imports GIS data (shapefiles, converted into KML) - Geocoding API(addresses to coordinates)

7.3.8 Questions?

Thank you! - Cody Jones

7.4 Introduction to PowerBI and Tableau

This presentation was prepared by Reesha Patel.

7.4.1 Introduction

Data visualization is an important data science tool for communicating complex relationships and conclusions in a more simple, comprehendable way. Transforming large datasets with numerous features into digestible visuals like charts or maps makes it easier to identify patterns and trends within the data. Further, data visualization plays a key role in exploratory data analysis (EDA) by just using visuals alone to investigate the data and answer overarching research questions.

Though python supplies various visualization packages, including matplotlib and seaborn, these tools often require advanced coding knowledge and can be time-consuming for creating complex, polished visuals. In contrast, tools like PowerBI and Tableau provide more functional and interactive displays for users to intuitively build visualizations using data.

7.4.2 What is PowerBI?

Visualization tool by Microsoft used for business intelligence and analysis.
Integrated with other Microsoft products (Excel, PowerPoint, Azure).
Connect data from a variety of sources, including from the cloud, databases, or Excel spreadsheets.
Use Power Query for data filtering and transformations, like merging or joining tables.
Use DAX (Data Analysis Expressions) for advanced data calculations.
Integration with Python or R for more advanced analytics.

7.4.3 What is Tableau?

Data visualization tool used to create interactive dashboards for analysis or storytelling.
Supports a wide variety of data files and can connect to databases (like SQL)
No programming knowledge required.
Utilizes the “Drag-and-Drop” method.
Perfect for exploratory data analysis and visual storytelling.
Provides free access to the software through Tableau Public

7.4.4 PowerBI VS Tableau

PowerBI:

Commonly used for business reporting, KPI dashboards, and operational analytics.
Use when already working with other Microsoft tools like Excel or PowerPoint.
Provides structured data workflows for organizing and analyzing business data.

Tableau:

Designed for highly customizable and interactive visualizations.
Best for advanced analytics, exploring complex datasets, and creating stories through visuals.
Preferred in fields like research, consulting, and when needing flexibility in design.

7.4.5 PowerBI

To start creating visualizations in PowerBI, the first step is finding a dataset to explore. In this example, we will use the dataset Netflix Movies and TV Shows, which contains information about the titles and their countries, genres, release dates, etc. until 2019. Using this dataset, we will aim to answer three questions:

What is the distribution of content types on Netflix?
Which countries produce the most Netflix content?
What are the most common genres on Netflix?

7.4.5.1 Import Data

First, navigate to your Microsoft profile, open PowerBI, and click the “New Report” button. In this window, select the type of data source to import (e.g. Excel workbook, CSV file) or click “Get data” to see a wider variety of options. Then, load the data file (like .xlsx) to establish a connection between the source and the PowerBI workspace. At this point, the data tables are loaded in Power Query where they can be viewed, cleaned, and transformed before building visualizations.

7.4.5.2 Examine, Clean, or Transform Data

Now, we can click each table to see its contents. This allows for general overview of the data to help identify any missing values, wrong data types, or other anomalies. Here, we can also select which tables we actually want to include in our report. In this example, will will choose netflix_titles, netflix_titles_category, and netflix_titles_countries.

By selecting the tables we want to include, we can then use Power Query to perform any transformations by clicking the “Transform Data” button. Here, the data can be cleaned and transformed through filtering rows/columns, changing data types, or restructuring columns. This prepares the data for analysis without having to change the original file or go back after already creating initial visualizations.

In this case, we want to filter out the two rows with null/wrong type values in the netflix_titles table since this will hurt our visualization of the distribution of content types. In the initial examination of the data, there were two rows with incorrect values in the type column (null and 1944). We want to filter these rows from the dataset. Additionally, in the netflix_titles_countries table, we want to make sure the data type for the country column is set to Country/Region. This way we can display titles by country on a spatial map.

These transformations are applied within Power Query and are a part of Power BI’s semantic model, which defines how the data is structured and used for analysis. An important feature of the semantic model is that the transformations can be revisited, modified, or added at any time without affecting the original dataset. The semantic model also allows for relationships between tables to be created. In this example, we connect the tables through their shared show_id column.

7.4.5.3 Using DAX

Another important feature of PowerBI is DAX, which is a language similar to Excel formulas used to create calculated columns, tables, measures, and other numerics for data analysis. DAX lets users define new relationships within the data and perform specific computations for unique visualizations. A major example of DAX, especially within the business application of PowerBI, is calculating profit based on Cost and Revenue columns: Profit = SUM(Sales[Revenue]) - SUM(Sales[Cost])

7.4.5.4 Visualizations

Now, after the data has been preprocessed and structured, visualizations can be built to explore various relationships. The are a variety of different visuals available in PowerBI, including bar charts, line graphs, pie charts, matrices, KPIs, interactive maps, etc. Each visualization has customizable aspects, ranging from basic formatting changes like titles or axis names to more complex changes like filters or adding advanced analytics.

The visualization page automatically creates an empty dashboard where you can add multiple visualizations on one page. Multiple pages can be created for various different dashboards. This page also consists of a “Data” tab, where the data tables are populated, and a “Visualizations” tab, where all the different options are available.

To create a visualization, start by clicking a specific type of visualization from the tab. Now, various field options appear below where columns can be dragged and dropped to define a relationship.

For example, to create a pie chart for the type of content, we can drag the type column from the netflix_titles table into the Legend field, which separates the categories into Movie or TV Show. The same column can also be dragged into the Values field, where it is automatically aggregated as a count. This creates a pie chart divided into two sections based on the number of titles that fall into each type category. This same drag-and-drop approach can be used to create other visualizations, allowing different relationships within the data to be explored.

7.4.5.4.1 Final Dashboard

After creating visualizations based on our three guiding questions, we can now gain insights by examining the whole dashboard. The visualizations reveal patterns in the distribution of content types, highlight which countries contribute the most content, and show the most common genres on Netflix up until 2019. Together, these insights provide a clearer understanding of the dataset and show how visualizations can be effectively used to explore and communicate data.

7.4.6 Tableau

To start creating visualizations in Tableau, the first step is also to find a dataset to explore. We will use the same dataset Netflix Movies and TV Shows and aim to answer the same three questions, now with a different visualization tool:

What is the distribution of content types on Netflix?
Which countries produce the most Netflix content?
What are the most common genres on Netflix?

7.4.6.1 Import Data

First, navigate to your Tableau profile (this example uses Tableau Public, which is the free online version of the Tableau software) and click the “Create Viz” button. In the popup window, select the data source by either uploading a file or connecting to an external source (e.g. Google Drive). Once loaded, the data appears in the Data Source tab, where tables can be viewed, inspected, and joined if necessary.

The Data Source tab allows for different tables to be connected together by creating a relationship based on a common column (the show_id column in this case). Once all the data tables of interest are added to the page, click Create extract to load the data into Tableau’s in-memory engine for improved performance and faster analysis.

7.4.6.2 Examine or Transform Data

Additionally, in the Data Source tab, the tables can be examined to better understand their structures. Here, data cleaning and basic transformations can also be performed, like renaming fields, changing data types, or adjusting table connections. This helps prepare the data for easier visualization.

For example, to visualize titles by country on a map, the country column in the netflix_titles_countries table must be assigned a geographic role. By changing its data type to Country/Region, Tableau recognizes the values as geographic data and can automatically generate map-based visualizations.

7.4.6.3 Visualizations

Once the data has been processed and extracted, it is ready for visualization. To start creating a visualization, first create a new empty sheet. In Tableau, each visualization is created on a separate sheet, which can later be combined into a dashboard.

Each sheet consists of three main components. First is the Data and Analytics panel, where the data tables and their fields are populated and can be selected for use in visualizations. The analytics tab provides additional options for further analysis, including trend lines, reference lines, and summaries.

The second component deals with customization of the appearance of the visualization. Here, various aspects of the visuals can be changed, including colors/graphics, size, labels, filters, marks, and other general formatting options.

Finally, the main blank canvas is where the visualization is displayed and updated as data is added. To build a visualization:

Drag data columns into view fields based on a relationship structure.
Select or adjust the visualization type using the Marks dropdown menu.
(Optionally) Click the Show Me button in the top-right corner and select the type of visualization.

For example, to create a pie chart for the distribution of content types, we can drag the Type column from the netflix_titles table into the Marks field. Then, drag the same Type column again, but change it to a count measure using the drop-down arrow. Then, we can change the Marks type to “Pie”, set the Type column as the color legend, and set the CNT(Type) as a “slice” or angle of the pie.

Optionally, another way to create this same visualization is through the Show Me tab. We can drag the Type column from the netflix_titles table into the Columns field. Then, drag the same Type column into the Rows field, but change it to a count measure using the drop-down arrow. Select the Pie Chart visualization from the Show Me tab to visualize the distribution of content types.

Lastly, we can edit the visualization in the Marks section. First, we can change the color of each pie section to follow a consistent format. We can also add a filter to the type column so we exclude the inconsistent/null values in the data set, and only show titles with a content type of Movie or TV Show. Adding filters helps refine the data and create clearer, more focused visualizations.

7.4.6.3.1 Final Dashboard

Once all visualizations are created independently, a new dashboard can be created to combine and summarize the results. Each individual visualization can be dragged onto the dashboard and arranged to create a comprehensive and clear layout.

This dashboard provides an overview of the data, allowing for multiple relationships to be analyzed at once and for dynamic visualization through filters and selections. This makes it easier to identify overall patterns and draw conclusions from the dataset.

7.4.6.4 Storylines

One final aspect of Tableau is its Story feature, which allows for data to be presented through a sequence of visualizations. Each story contains multiple “story points,” with each point displaying a certain graphic along with some descriptive caption.

This feature is especially helpful for presentations, especially when guiding a non-technical audience through the data analysis process. Rather than showing all the visualizations at once, a story follows a narrative and highlights key graphics or conclusions along the way. For example, it could start with an overview of the data, then move into each research question and the specific visualization to explain/support it.

Overall, structuring visualizations through storytelling helps communicate the results of data exploration more effectively.

7.4.7 Integration With Python

An advantage of both PowerBI and Tableau is their ability to integrate python for extended capabilities, specifically for preprocessing data or for more advanced analytics. Though the tools provide a strong interface for building visualizations, python allows for more flexible data manipulation, complex computations, dataset merging, statistical modeling, and machine learning, which can all be integrated into the dynamic visualizations. Combining python with these tools helps easily create more accurate and meaningful visualization and data workflows.

7.4.7.1 Preprocess Using Python

This method involves importing and preprocessing the dataset using python first, and then uploading the new, cleaned dataset into either visualization platform. For users who are more comfortable with programming, this approach might be more flexible than cleaning the data in PowerBI or Tableau.

Preprocessing in python can include tasks like examining the dataset, handling missing values or null rows, or creating new columns through calculations. For example, we can import the dataset and check for missing or incorrect values in the type column:

import pandas as pd
from pathlib import Path


def resolve_powerbi_tableau_path(filename):
  project_relative = Path("intro-powerbi-tableau/powerbi_tableau_files") / filename
  if project_relative.exists():
    return project_relative
  return Path("powerbi_tableau_files") / filename

# load the dataset
df = pd.read_excel(resolve_powerbi_tableau_path("netflix_titles.xlsx"))

# check for incorrect/null values - not Movie or TV Show
incorrect = df[~df['type'].isin(['Movie', 'TV Show'])]
print("number of incorrect rows: ", incorrect.shape[0])

number of incorrect rows:  2

From here, we can further examine the rows with incorrect values and determine to drop them from the dataset:

# print out the incorrect rows
print(incorrect)

# remove invalid rows
df = df[df['type'].isin(['Movie', 'TV Show'])]

          duration_minutes     duration_seasons  type  title  \
2018      Flying Fortress"  2017-03-31 00:00:00  1944  TV-PG   
4525   and probably will."             80188902   NaN    NaN   

                                             date_added  release_year rating  \
2018  This documentary centers on the crew of the B-...    80119194.0    NaN   
4525                                                NaN           NaN    NaN   

     description  show_id  
2018         NaN      NaN  
4525         NaN      NaN

Lastly, we can now export our cleaned data as a CSV file:

df.to_csv(resolve_powerbi_tableau_path("netflix_titles_cleaned.csv"), index=False)

The cleaned dataset can then be loaded into Power BI or Tableau for visualization.

7.4.7.2 Running Python in PowerBI

Running python in PowerBI requires a local python setup. Currently, python can only be integrated on Windows (PowerBI Desktop) where the python scripting option must be enabled. There are three main ways to use python in PowerBi:

Import Data with a Python Script
- Use python scripts to create dataframes or load data from APIs/external sources
- Script must return a Pandas DataFrame
- During the data import process, click “Get Data,” set the source to “Python script,” and then type/paste your code
Transform Data in Power Query
- Use python scripts for advanced data cleaning or analysis
- Uses an existing dataset as variable input
- When examining the data tables after import, click “Transform Data” and then “Run Python Script”
Create Custom Visuals
- Use python libraries like matplotlib or seaborn to program visuals
- Note: these visualizations are static (non-interactive and will not update with data changes)
- In the visualization tab, select the python visual icon to enable it and use plt.show() to display the visual

7.4.7.3 TabPy

To integrate python with Tableau, a package called TabPy (Tableau Python Server) can be used. TabPy lets Tableau remotely execute python code and return results as calculated values within the visuals.

To set up TabPy in Tableau:

Install tabpy in terminal.
Set up a connection in Tableau using a local host server.
Test the connection to make sure the link is active.

Once connected, python can be used within Tableau to perform advanced calculations and analytics.

Some common uses of TabPy include:

Calculated Fields
Tableau provides built-in SCRIPT functions that allow python code to be executed within a workbook:

SCRIPT_REAL will return numeric values
SCRIPT_STR will return string values
SCRIPT_INT will return integers
SCRIPT_BOOL will return booleans

For example: SCRIPT_REAL("return [x * 2 for x in _arg1]", SUM([Sales])). This line will use the input values from the SUM([Sales]) field, multiply each value by 2, and then return the result as a numeric output. This creates a calculated field used only within the visualization (it will not modify the original dataset or create a new column).

Tableau Prep Builder
Python can also be for data preparation by adding a Script step in Tableau Prep. This requires a script file (.py) that takes a Pandas DataFrame as input and will then return a transformed DataFrame.

7.4.8 Conclusion

Data visualization is a useful tool for investigating complex datasets and discovering patterns or trends within the data.
Both PowerBI and Tableau provide interactive platforms for creating dynamic visualizations without requiring advanced coding knowledge.
PowerBI is more commonly used for business analytics and reporting, especially within the Microsoft ecosystem. Tableau pertains more to data exploration, customization, and visual storytelling.
Integrating python with data visualization tools allows for more advanced data preprocessing and analytics outside of the built-in features.