Visualizing various types of data

Lecture 5

Dr. Greg Chism

University of Arizona
INFO 511 - Spring 2025

Setup

# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from palmerpenguins import load_penguins

# Load the penguins dataset
penguins = load_penguins()

# Set theme
sns.set_theme(style="whitegrid")

Data Visualization

Examining data visualization

Discuss the following for the visualization in the #lecture-discussions Slack Channel.

Source: Twitter

Violin plots

Code
plt.figure(figsize=(8, 6))
sns.violinplot(x="species", y="body_mass_g", data=penguins)
plt.title('Violin Plot of Body Mass by Species')
plt.show()plt.figure(figsize=(8, 6))
sns.violinplot(x="species", y="body_mass_g", data=penguins)
plt.title('Violin Plot of Body Mass by Species')
plt.show()plt.figure(figsize=(8, 6))
sns.violinplot(x="species", y="body_mass_g", data=penguins)
plt.title('Violin Plot of Body Mass by Species')
plt.show()plt.figure(figsize=(8, 6))
sns.violinplot(x="species", y="body_mass_g", data=penguins)
plt.title('Violin Plot of Body Mass by Species')
plt.show()

Multiple geoms

Code
plt.figure(figsize=(8, 6))
sns.violinplot(x="species", y="body_mass_g", data=penguins)
sns.stripplot(x="species", y="body_mass_g", data=penguins, jitter=False, color='black')
plt.title('Violin Plot with Points of Body Mass by Species')
plt.show()

Multiple geoms

Code
plt.figure(figsize=(8, 6))
sns.violinplot(x="species", y="body_mass_g", data=penguins)
sns.stripplot(x="species", y="body_mass_g", data=penguins, jitter=True, color='black')
plt.title('Violin Plot with Points of Body Mass by Species')
plt.show()

Multiple geoms + aesthetics

Code
plt.figure(figsize=(8, 6))
sns.violinplot(x="species", y="body_mass_g", data=penguins)
sns.stripplot(x="species", y="body_mass_g", data=penguins, jitter=True, hue='species')
plt.title('Violin Plot with Jittered Points and Color by Species')
plt.legend(title='Species')
plt.show()

Multiple geoms + aesthetics

Code
plt.figure(figsize=(8, 6))
sns.violinplot(x="species", y="body_mass_g", data=penguins)
sns.stripplot(x="species", y="body_mass_g", data=penguins, jitter=True, hue='species')
plt.title('Violin Plot with Jittered Points, Color by Species, and No Legend')
plt.legend(title='Species').remove()
plt.show()

Multiple geoms + aesthetics

Code
plt.figure(figsize=(8, 6))
sns.violinplot(x="species", y="body_mass_g", data=penguins, palette='colorblind')
sns.stripplot(x="species", y="body_mass_g", data=penguins, jitter=True, hue='species', palette='colorblind')
plt.title('Violin Plot with Jittered Points, Color by Species, No Legend, and Colorblind Palette')
plt.legend(title='Species').remove()
plt.show()

Visualizing various types of data

The way data is displayed matters

What do these three plots show?

Source: #barbarplots

Visualizing penguins

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from palmerpenguins import load_penguins

penguins = load_penguins()

penguins.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007

Artwork by @allison_horst.

Univariate analysis

Univariate analysis

Analyzing a single variable:

  • Numerical: histogram, box plot, density plot, etc.

  • Categorical: bar plot, pie chart, etc.

Histogram

Code
plt.figure(figsize=(8, 6))
sns.histplot(penguins['body_mass_g'], bins=30)
plt.title('Histogram of Penguin Body Mass')
plt.xlabel('Body Mass (g)')
plt.ylabel('Count')
plt.show()

Boxplot

Code
plt.figure(figsize=(8, 6))
sns.boxplot(y=penguins['body_mass_g'])
plt.title('Boxplot of Penguin Body Mass')
plt.ylabel('Body Mass (g)')
plt.show()

Density plot

Code
plt.figure(figsize=(10, 6))
sns.kdeplot(penguins['body_mass_g'], fill=True)
plt.title('Density Plot of Penguin Body Mass')
plt.xlabel('Body Mass (g)')
plt.ylabel('Density')
plt.show()

Bivariate analysis

Bivariate analysis

Analyzing the relationship between two variables:

  • Numerical + numerical: scatterplot

  • Numerical + categorical: side-by-side box plots, violin plots, etc.

  • Categorical + categorical: stacked bar plots

  • Using an aesthetic (e.g., fill, color, shape, etc.) or facets to represent the second variable in any plot

Side-by-side box plots

Code
plt.figure(figsize=(8, 6))
sns.boxplot(x="species", y="body_mass_g", data=penguins)
plt.title('Side-by-side Box Plots of Body Mass by Species')
plt.xlabel('Species')
plt.ylabel('Body Mass (g)')
plt.show()

Density plots

Code
plt.figure(figsize=(8, 6))
sns.kdeplot(data=penguins, x="body_mass_g", hue="species", fill=True)
plt.title('Density Plot of Body Mass by Species')
plt.xlabel('Body Mass (g)')
plt.ylabel('Density')
plt.show()

Multivariate analysis

Bechdel Test

The Bechdel test also known as the Bechdel-Wallace test, is a measure of the representation of women in film and other fiction. The test asks whether a work features at least two female characters who have a conversation about something other than a man. Some versions of the test also require that those two female characters have names.

Load Bechdel test data

Load the Bechdel test data with pd.read_csv()

bechdel = pd.read_csv("data/bechdel.csv")

list() the .columns names of the bechdel data:

list(bechdel.columns)
['title', 'year', 'gross_2013', 'budget_2013', 'roi', 'binary', 'clean_test']

ROI by test result

What about this plot makes it difficult to evaluate how ROI varies by Bechdel test result?

Code
plt.figure(figsize=(8, 4))
sns.stripplot(x='roi', y='clean_test', hue='binary', data=bechdel)
plt.title('ROI by Bechdel Test Result')
plt.xlabel('ROI')
plt.ylabel('Bechdel Test Result')
plt.legend(title='Binary')
plt.show()

Movies with high ROI

What are the movies with highest ROI?

high_roi_movies = bechdel[bechdel['roi'] > 400][['title', 'roi', 'budget_2013', 'gross_2013', 'year', 'clean_test']]
print(high_roi_movies)
                        title         roi  budget_2013   gross_2013  year  \
703       Paranormal Activity  671.336857       505595  339424558.0  2007   
1319  The Blair Witch Project  648.065333       839077  543776715.0  1999   
1575              El Mariachi  583.285665        11622    6778946.0  1992   

     clean_test  
703     dubious  
1319         ok  
1575    nowomen  

ROI by test result

Zoom in: What about this plot makes it difficult to evaluate how ROI varies by Bechdel test result?

Code
plt.figure(figsize=(8, 4))
sns.boxplot(x='roi', y='clean_test', hue='binary', data=bechdel)
plt.xlim(0, 15)
plt.title('Zoomed in ROI by Bechdel Test Result')
plt.xlabel('ROI')
plt.ylabel('Bechdel Test Result')
plt.legend(title='Binary')
plt.show()plt.figure(figsize=(8, 4))
sns.boxplot(x='roi', y='clean_test', hue='binary', data=bechdel)
plt.xlim(0, 15)
plt.title('Zoomed in ROI by Bechdel Test Result')
plt.xlabel('ROI')
plt.ylabel('Bechdel Test Result')
plt.legend(title='Binary')
plt.show()

Sneak preview…



to next week’s data wrangling pipelines…

Median ROI

median_roi = bechdel['roi'].median()
print(f"Median ROI: {median_roi}")
Median ROI: 3.9051317558839456

Median ROI by test result

median_roi_by_test = bechdel.groupby('clean_test')['roi'].median().reset_index()
print(median_roi_by_test)
  clean_test       roi
0    dubious  3.795816
1        men  3.964457
2     notalk  3.688260
3    nowomen  3.265901
4         ok  4.211049

ROI by test result with median line

What does this plot say about return-on-investment on movies that pass the Bechdel test?

Code
plt.figure(figsize=(8, 4))
sns.boxplot(x='roi', y='clean_test', hue='binary', data=bechdel)
plt.axvline(x=4.21, color='red', linestyle='--')
plt.xlim(0, 15)
plt.title('ROI by Bechdel Test Result with Median Line')
plt.xlabel('ROI')
plt.ylabel('Bechdel Test Result')
plt.legend(title='Binary')
plt.show()

🔗 datasciaz.netlify.app

1 / 32
Visualizing various types of data Lecture 5 Dr. Greg Chism University of Arizona INFO 511 - Spring 2025

  1. Slides

  2. Tools

  3. Close
  • Visualizing various types of data
  • Setup
  • Data Visualization
  • Examining data visualization
  • Violin plots
  • Multiple geoms
  • Multiple geoms
  • Multiple geoms + aesthetics
  • Multiple geoms + aesthetics
  • Multiple geoms + aesthetics
  • Visualizing various types of data
  • The way data is displayed matters
  • Visualizing penguins
  • Univariate analysis
  • Univariate analysis
  • Histogram
  • Boxplot
  • Density plot
  • Bivariate analysis
  • Bivariate analysis
  • Side-by-side box plots
  • Density plots
  • Multivariate analysis
  • Bechdel Test
  • Load Bechdel test data
  • ROI by test result
  • Movies with high ROI
  • ROI by test result
  • Sneak preview…
  • Median ROI
  • Median ROI by test result
  • ROI by test result with median line
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help