import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter
import seaborn as sns
= pd.read_csv("data/degrees.csv") degrees
AE 05: Wrangling College Majors
Goal
Our ultimate goal in this application exercise is to make the following data visualization.
Data
For this exercise you will work with data on the proportions of Bachelor’s degrees awarded in the US between 2005 and 2015. The dataset you will use is in your data/
folder and it’s called degrees.csv
.
And let’s take a look at the data.
degrees.head()
degree | 2019 | 2020 | 2021 | 2022 | 2023 | |
---|---|---|---|---|---|---|
0 | Information Science & eSociety (BA) | 63.0 | 61.0 | 67.0 | 71 | 38 |
1 | Information Science (BS) | NaN | NaN | NaN | 16 | 57 |
2 | Information (PhD) | 2.0 | 3.0 | 1.0 | 1 | 1 |
3 | Library & Information Science (MA) | 47.0 | 57.0 | 72.0 | 42 | 58 |
4 | Information (MS) | 8.0 | 10.0 | 13.0 | 5 | 2 |
Pivoting
- Pivot the
degrees
data frame longer such that each row represents a degree type / year combination andyear
andn
umber of graduates for that year are columns in the data frame.
# add your code here
- Question: What is the type of the
year
variable? Why? What should it be?
Add your response here.
- Start over with pivoting, and this time also make sure
year
is a numerical variable in the resulting data frame.
# add your code here
- Question: What would an
NA
mean in this context? Hint: The data come from the university registrars, and they have records on every single graduates, there shouldn’t be anything “unknown” to them about who graduated when.
Add your response here.
- Add on to your pipeline that you started with pivoting and convert
NA
s inn
to0
s.
# add your code here
- In our plot the degree types are BA, BS, MA, MS, and PhD. This information is in our dataset, in the
degree
column, but this column also has additional characters we don’t need. Create a new column calleddegree_type
with levels BA, BS, MA, MS, and PhD (in this order) based ondegree
. Do this by adding on to your pipeline from earlier.
# add your code here
Now we start making our plot, but let’s not get too fancy right away. Create the following plot, which will serve as the “first draft” on the way to our Goal. Do this by adding on to your pipeline from earlier.
Hint: Make sure to state
ci=None
when usingsns.lineplot()
# add your code here
- What aspects of the plot need to be updated to go from the draft you created above to the Goal plot at the beginning of this application exercise.
Add your response here.
- Update x-axis scale such that the years displayed go from 2019 to 2023 in unique years. Do this by adding on to your pipeline from earlier.
# add your code here
Update line colors using the following level / color assignments. Once again, do this by adding on to your pipeline from earlier.
BA: “#53868B”
BS: “#7AC5CD”
MA: “#89a285”
MS: “#8B814C”
PhD: “#CDBE70”
# add your code here
- Update the plot labels (
title
,x
, andy
) and usesns.set_style("white_grid")
. Once again, do this by adding on to your pipeline from earlier.
# add your code here