AE 05: Wrangling College Majors

Suggested answers

Application exercise
Answers
Important

These are suggested answers. This document should be used as reference only, it’s not designed to be an exhaustive key.

Goal

Our ultimate goal in this application exercise is to make the following data visualization.

Data

For this exercise you will work with data on the proportions of Bachelor’s degrees awarded in the US between 2005 and 2015. The dataset you will use is in your data/ folder and it’s called degrees.csv.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

infosci = pd.read_csv("data/degrees.csv")

And let’s take a look at the data.

infosci.head()
degree 2019 2020 2021 2022 2023
0 Information Science & eSociety (BA) 63.0 61.0 67.0 71 38
1 Information Science (BS) NaN NaN NaN 16 57
2 Information (PhD) 2.0 3.0 1.0 1 1
3 Library & Information Science (MA) 47.0 57.0 72.0 42 58
4 Information (MS) 8.0 10.0 13.0 5 2

Pivoting

  • Pivot the degrees data frame longer such that each row represents a degree type / year combination and year and number of graduates for that year are columns in the data frame.
infosci_long = infosci.melt(id_vars='degree', var_name='year', value_name='n')

infosci_long.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   degree  30 non-null     object 
 1   year    30 non-null     object 
 2   n       24 non-null     float64
dtypes: float64(1), object(2)
memory usage: 852.0+ bytes
  • Question: What is the type of the year variable? Why? What should it be?

It’s an object variable since the information came from the columns of the original data frame and Python cannot know that these character strings represent years. The variable type should be numeric (int64).

  • Start over with pivoting, and this time also make sure year is a numerical variable in the resulting data frame.
infosci_long = infosci.melt(id_vars='degree', var_name='year', value_name='n')
infosci_long['year'] = pd.to_numeric(infosci_long['year'])

infosci_long.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   degree  30 non-null     object 
 1   year    30 non-null     int64  
 2   n       24 non-null     float64
dtypes: float64(1), int64(1), object(1)
memory usage: 852.0+ bytes
  • Question: What does an NaN mean in this context? Hint: The data come from the university registrar, and they have records on every single graduates, there shouldn’t be anything “unknown” to them about who graduated when.

NAs should actually be 0s.

  • Add on to your pipeline that you started with pivoting and convert NaNs in n to 0s.
infosci_long = infosci.melt(id_vars='degree', var_name='year', value_name='n')
infosci_long['year'] = pd.to_numeric(infosci_long['year'])
infosci_long['n'] = infosci_long['n'].fillna(0)

infosci_long.isna().sum()
degree    0
year      0
n         0
dtype: int64
  • In our plot the degree types are BA, BS, MA, MS, and PhD. This information is in our dataset, in the degree column, but this column also has additional characters we don’t need. Create a new column called degree_type with levels BA, BS, MA, MS, and PhD (in this order) based on degree. Do this by adding on to your pipeline from earlier.
infosci_long = infosci.melt(id_vars='degree', var_name='year', value_name='n')
infosci_long['year'] = pd.to_numeric(infosci_long['year'])
infosci_long['n'] = infosci_long['n'].fillna(0)

infosci_long[['major', 'degree_type']] = infosci_long['degree'].str.split(' \(', expand=True)
infosci_long['degree_type'] = infosci_long['degree_type'].str.replace('\)', '', regex=True)

degree_order = pd.CategoricalDtype(categories=["BA", "BS", "MA", "MS", "PhD"], ordered=True)
infosci_long['degree_type'] = infosci_long['degree_type'].astype(degree_order)

infosci_long.head()
degree year n major degree_type
0 Information Science & eSociety (BA) 2019 63.0 Information Science & eSociety BA
1 Information Science (BS) 2019 0.0 Information Science BS
2 Information (PhD) 2019 2.0 Information PhD
3 Library & Information Science (MA) 2019 47.0 Library & Information Science MA
4 Information (MS) 2019 8.0 Information MS
  • Now we start making our plot, but let’s not get too fancy right away. Create the following plot, which will serve as the “first draft” on the way to our Goal. Do this by adding on to your pipeline from earlier.
sns.set_style("darkgrid")

plt.figure(figsize=(8, 6))
sns.lineplot(data=infosci_long, x='year', y='n', hue='degree_type', ci=None, marker='o')
plt.legend()
plt.show()

  • What aspects of the plot need to be updated to go from the draft you created above to the Goal plot at the beginning of this application exercise.

    • x-axis scale: need to go from 2019 to 2023 in unique year values

    • line colors

    • axis labels: title, x, y

    • theme

  • Update x-axis scale such that the years displayed go from 2019 to 2023 in unique years. Do this by adding on to your pipeline from earlier.

plt.figure(figsize=(8, 6))
sns.lineplot(data=infosci_long, x='year', y='n', hue='degree_type', ci=None, marker='o')
plt.xticks(ticks=infosci_long['year'].unique(), labels=infosci_long['year'].unique())
plt.legend()
plt.show()

  • Update line colors using the following level / color assignments. Once again, do this by adding on to your pipeline from earlier.

    • BA: “#53868B”

    • BS: “#7AC5CD”

    • MA: “#89a285”

    • MS: “#8B814C”

    • PhD: “#CDBE70”

custom_palette = {
    "BA": "#53868B",
    "BS": "#7AC5CD",
    "MA": "#89a285",
    "MS": "#8B814C",
    "PhD": "#CDBE70"
}

plt.figure(figsize=(8, 6))
sns.lineplot(data=infosci_long, x='year', y='n', hue='degree_type', ci=None, marker='o')
plt.xticks(ticks=infosci_long['year'].unique(), labels=infosci_long['year'].unique())
plt.legend()
plt.show()

  • Update the plot labels (title, x, and y) and use sns.set_style("white_grid"). Once again, do this by adding on to your pipeline from earlier.
sns.set_style("whitegrid")

custom_palette = {
    "BA": "#53868B",
    "BS": "#7AC5CD",
    "MA": "#89a285",
    "MS": "#8B814C",
    "PhD": "#CDBE70"
}

plt.figure(figsize=(8, 6))
sns.lineplot(data=infosci_long, x='year', y='n', hue='degree_type', ci=None, marker='o', palette=custom_palette)
plt.title('College of Information Science degrees over the years\nAcademic years 2019 - 2023')
plt.xticks(ticks=infosci_long['year'].unique(), labels=infosci_long['year'].unique())
plt.xlabel('Graduation year')
plt.ylabel('Number of students graduating')
plt.legend(title='Degree type')
plt.grid(True)
plt.show()