Visualizing various types of data

Lecture 5

Dr. Greg Chism

University of Arizona
INFO 511

Setup

# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from palmerpenguins import load_penguins

# Load the penguins dataset
penguins = load_penguins()

# Set theme
sns.set_theme(style="whitegrid")

Data Visualization

Examining data visualization

Discuss the following for the visualization in the #lecture-discussions Slack Channel.

Violin plots

Code

plt.figure(figsize=(8, 6))
sns.violinplot(x="species", y="body_mass_g", data=penguins)
plt.title('Violin Plot of Body Mass by Species')
plt.show()

Multiple geoms

Code

plt.figure(figsize=(8, 6))
sns.violinplot(x="species", y="body_mass_g", data=penguins)
sns.stripplot(x="species", y="body_mass_g", data=penguins, jitter=False, color='black')
plt.title('Violin Plot with Points of Body Mass by Species')
plt.show()

Multiple geoms

Code

plt.figure(figsize=(8, 6))
sns.violinplot(x="species", y="body_mass_g", data=penguins)
sns.stripplot(x="species", y="body_mass_g", data=penguins, jitter=True, color='black')
plt.title('Violin Plot with Points of Body Mass by Species')
plt.show()

Multiple geoms + aesthetics

Code

plt.figure(figsize=(8, 6))
sns.violinplot(x="species", y="body_mass_g", data=penguins)
sns.stripplot(x="species", y="body_mass_g", data=penguins, jitter=True, hue='species')
plt.title('Violin Plot with Jittered Points and Color by Species')
plt.legend(title='Species')
plt.show()

Multiple geoms + aesthetics

Code

plt.figure(figsize=(8, 6))
sns.violinplot(x="species", y="body_mass_g", data=penguins)
sns.stripplot(x="species", y="body_mass_g", data=penguins, jitter=True, hue='species')
plt.title('Violin Plot with Jittered Points, Color by Species, and No Legend')
plt.legend(title='Species').remove()
plt.show()

Multiple geoms + aesthetics

Code

plt.figure(figsize=(8, 6))
sns.violinplot(x="species", y="body_mass_g", data=penguins, palette='colorblind')
sns.stripplot(x="species", y="body_mass_g", data=penguins, jitter=True, hue='species', palette='colorblind')
plt.title('Violin Plot with Jittered Points, Color by Species, No Legend, and Colorblind Palette')
plt.legend(title='Species').remove()
plt.show()

Visualizing various types of data

The way data is displayed matters

What do these three plots show?

Visualizing `penguins`

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from palmerpenguins import load_penguins

penguins = load_penguins()

penguins.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007

Univariate analysis

Analyzing a single variable:

Numerical: histogram, box plot, density plot, etc.
Categorical: bar plot, pie chart, etc.

Histogram

Code

plt.figure(figsize=(8, 6))
sns.histplot(penguins['body_mass_g'], bins=30)
plt.title('Histogram of Penguin Body Mass')
plt.xlabel('Body Mass (g)')
plt.ylabel('Count')
plt.show()

Boxplot

Code

plt.figure(figsize=(8, 6))
sns.boxplot(y=penguins['body_mass_g'])
plt.title('Boxplot of Penguin Body Mass')
plt.ylabel('Body Mass (g)')
plt.show()

Density plot

Code

plt.figure(figsize=(10, 6))
sns.kdeplot(penguins['body_mass_g'], fill=True)
plt.title('Density Plot of Penguin Body Mass')
plt.xlabel('Body Mass (g)')
plt.ylabel('Density')
plt.show()

Bivariate analysis

Analyzing the relationship between two variables:

Numerical + numerical: scatterplot
Numerical + categorical: side-by-side box plots, violin plots, etc.
Categorical + categorical: stacked bar plots
Using an aesthetic (e.g., fill, color, shape, etc.) or facets to represent the second variable in any plot

Side-by-side box plots

Code

plt.figure(figsize=(8, 6))
sns.boxplot(x="species", y="body_mass_g", data=penguins)
plt.title('Side-by-side Box Plots of Body Mass by Species')
plt.xlabel('Species')
plt.ylabel('Body Mass (g)')
plt.show()

Density plots

Code

plt.figure(figsize=(8, 6))
sns.kdeplot(data=penguins, x="body_mass_g", hue="species", fill=True)
plt.title('Density Plot of Body Mass by Species')
plt.xlabel('Body Mass (g)')
plt.ylabel('Density')
plt.show()

Multivariate analysis

Bechdel Test

The Bechdel test also known as the Bechdel-Wallace test, is a measure of the representation of women in film and other fiction. The test asks whether a work features at least two female characters who have a conversation about something other than a man. Some versions of the test also require that those two female characters have names.

Load Bechdel test data

Load the Bechdel test data with pd.read_csv()

bechdel = pd.read_csv("data/bechdel.csv")

list() the .columns names of the bechdel data:

list(bechdel.columns)

['title', 'year', 'gross_2013', 'budget_2013', 'roi', 'binary', 'clean_test']

ROI by test result

What about this plot makes it difficult to evaluate how ROI varies by Bechdel test result?

Code

plt.figure(figsize=(8, 4))
sns.stripplot(x='roi', y='clean_test', hue='binary', data=bechdel)
plt.title('ROI by Bechdel Test Result')
plt.xlabel('ROI')
plt.ylabel('Bechdel Test Result')
plt.legend(title='Binary')
plt.show()

Movies with high ROI

What are the movies with highest ROI?

high_roi_movies = bechdel[bechdel['roi'] > 400][['title', 'roi', 'budget_2013', 'gross_2013', 'year', 'clean_test']]
print(high_roi_movies)

                        title         roi  budget_2013   gross_2013  year  \
703       Paranormal Activity  671.336857       505595  339424558.0  2007   
1319  The Blair Witch Project  648.065333       839077  543776715.0  1999   
1575              El Mariachi  583.285665        11622    6778946.0  1992   

     clean_test  
703     dubious  
1319         ok  
1575    nowomen

ROI by test result

Zoom in: What about this plot makes it difficult to evaluate how ROI varies by Bechdel test result?

Code

plt.figure(figsize=(8, 4))
sns.boxplot(x='roi', y='clean_test', hue='binary', data=bechdel)
plt.xlim(0, 15)
plt.title('Zoomed in ROI by Bechdel Test Result')
plt.xlabel('ROI')
plt.ylabel('Bechdel Test Result')
plt.legend(title='Binary')
plt.show()

Sneak preview…

to next week’s data wrangling pipelines…

Median ROI

median_roi = bechdel['roi'].median()
print(f"Median ROI: {median_roi}")

Median ROI: 3.9051317558839456

Median ROI by test result

median_roi_by_test = bechdel.groupby('clean_test')['roi'].median().reset_index()
print(median_roi_by_test)

  clean_test       roi
0    dubious  3.795816
1        men  3.964457
2     notalk  3.688260
3    nowomen  3.265901
4         ok  4.211049

ROI by test result with median line

What does this plot say about return-on-investment on movies that pass the Bechdel test?

Code

plt.figure(figsize=(8, 4))
sns.boxplot(x='roi', y='clean_test', hue='binary', data=bechdel)
plt.axvline(x=4.21, color='red', linestyle='--')
plt.xlim(0, 15)
plt.title('ROI by Bechdel Test Result with Median Line')
plt.xlabel('ROI')
plt.ylabel('Bechdel Test Result')
plt.legend(title='Binary')
plt.show()

Visualizing various types of data

Setup

Data Visualization

Examining data visualization

Violin plots

Multiple geoms

Multiple geoms

Multiple geoms + aesthetics

Multiple geoms + aesthetics

Multiple geoms + aesthetics

Visualizing various types of data

The way data is displayed matters

Visualizing penguins

Univariate analysis

Univariate analysis

Histogram

Boxplot

Density plot

Bivariate analysis

Bivariate analysis

Side-by-side box plots

Density plots

Multivariate analysis

Bechdel Test

Load Bechdel test data

ROI by test result

Movies with high ROI

ROI by test result

Sneak preview…

Median ROI

Median ROI by test result

ROI by test result with median line

Visualizing `penguins`