Exam 1 review

Lecture 9

Dr. Greg Chism

University of Arizona
INFO 511 - Fall 2024

Setup

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid", font_scale=1.2)

Type coercion

Explicit vs. implicit type coercion

  • Explicit type coercion: You ask Python to change the type of a variable

  • Implicit type coercion: Python changes / makes assumptions for you about the type of a variable without you asking for it

    • This happens because in a series, you can’t have multiple types of values

Vectors

  • A vector is a collection of values

    • Atomic vectors can only contain values of the same type

    • Lists can contain values of different types

  • Why do we care? Because each column of a data frame is a vector.

df = pd.DataFrame({
    'x': [1, 2, 3],          # numeric (int)
    'y': ['a', 'b', 'c'],    # character
    'z': [True, False, True] # boolean
})
df
x y z
0 1 a True
1 2 b False
2 3 c True

Explicit coercion

✅ From numeric to character

df['x_new'] = df['x'].astype(str)
df
x y z x_new
0 1 a True 1
1 2 b False 2
2 3 c True 3

Explicit coercion

❌ From character to numeric

df['y_new'] = pd.to_numeric(df['y'], errors='coerce')
df
x y z x_new y_new
0 1 a True 1 NaN
1 2 b False 2 NaN
2 3 c True 3 NaN

Implicit coercion

Which of the column types were implicitly coerced?

df = pd.DataFrame({
    'w': [1, 2, 3],
    'x': ['a', 'b', 4],
    'y': ['c', 'd', np.nan],
    'z': [5, 6, np.nan],
})
df
w x y z
0 1 a c 5.0
1 2 b d 6.0
2 3 4 NaN NaN

Collecting data

Suppose you conduct a survey and ask students their student ID number and number of credits they’re taking this semester. What is the type of each variable?

survey_raw = pd.DataFrame({
    'student_id': [273674, 298765, 287129, "I don't remember"],
    'n_credits': [4, 4.5, "I'm not sure yet", "2 - underloading"]
})
survey_raw
student_id n_credits
0 273674 4
1 298765 4.5
2 287129 I'm not sure yet
3 I don't remember 2 - underloading

Cleaning data

survey = survey_raw.copy()
survey['student_id'] = survey['student_id'].replace("I don't remember", np.nan)
survey['n_credits'] = survey['n_credits'].replace({
    "I'm not sure yet": np.nan,
    "2 - underloading": "2"
})
survey['n_credits'] = pd.to_numeric(survey['n_credits'])
survey
student_id n_credits
0 273674.0 4.0
1 298765.0 4.5
2 287129.0 NaN
3 NaN 2.0

Cleaning data – alternative

survey = survey_raw.copy()
survey['student_id'] = pd.to_numeric(survey['student_id'], errors='coerce')
survey['n_credits'] = pd.to_numeric(survey['n_credits'], errors='coerce')
survey
student_id n_credits
0 273674.0 4.0
1 298765.0 4.5
2 287129.0 NaN
3 NaN NaN

Recap: Type coercion

  • If variables in a DataFrame have multiple types of values, Python will coerce them into a single type, which may or may not be what you want.

  • If what Python does by default is not what you want, you can use explicit coercion functions like pd.to_numeric(), astype(), etc., to turn them into the types you want them to be, which will generally also involve cleaning up the features of the data that caused the unwanted implicit coercion in the first place.

Aesthetic mappings

loan50 example DataFrame

loan50 = pd.read_csv("data/loan50.csv")
loan50.head()
state emp_length term homeownership annual_income verified_income debt_to_income total_credit_limit total_credit_utilized num_cc_carrying_balance loan_purpose loan_amount grade interest_rate public_record_bankrupt loan_status has_second_income total_income
0 NJ 3.0 60 rent 59000.0 Not Verified 0.557525 95131 32894 8 debt_consolidation 22000 B 10.90 0 Current False 59000.0
1 CA 10.0 36 rent 60000.0 Not Verified 1.305683 51929 78341 2 credit_card 6000 B 9.92 1 Current False 60000.0
2 SC NaN 36 mortgage 75000.0 Verified 1.056280 301373 79221 14 debt_consolidation 25000 E 26.30 0 Current False 75000.0
3 CA 0.0 36 rent 75000.0 Not Verified 0.574347 59890 43076 10 credit_card 6000 B 9.92 0 Current False 75000.0
4 OH 4.0 60 mortgage 254000.0 Not Verified 0.238150 422619 60490 2 home_improvement 25000 B 9.43 0 Current False 254000.0

Aesthetic mappings

What will the following code result in?

plt.figure(figsize=(8, 6))
sns.scatterplot(data=loan50, x='annual_income', y='interest_rate', hue='homeownership', style='homeownership', palette='colorblind')
plt.show()

Aesthetic mappings

plt.figure(figsize=(8, 6))
sns.scatterplot(data=loan50, x='annual_income', y='interest_rate', hue='homeownership', style='homeownership', palette='colorblind')
plt.show()

Multiple plot layers

What will the following code result in?

plt.figure(figsize=(8, 6))
sns.scatterplot(data=loan50, x='annual_income', y='interest_rate', hue='homeownership', style='homeownership', palette='colorblind')
sns.lineplot(data=loan50, x='annual_income', y='interest_rate', hue='homeownership', legend=False, palette='colorblind')
plt.show()

Multiple plot layers

plt.figure(figsize=(8, 6))
sns.scatterplot(data=loan50, x='annual_income', y='interest_rate', hue='homeownership', style='homeownership', palette='colorblind')
sns.lineplot(data=loan50, x='annual_income', y='interest_rate', hue='homeownership', legend=False, palette='colorblind')
plt.show()

Mapping vs. setting

What will the following code result in?

plt.figure(figsize=(8, 6))
sns.scatterplot(data=loan50, x='annual_income', y='interest_rate', hue='homeownership', palette='colorblind')
sns.lineplot(data=loan50, x='annual_income', y='interest_rate', color='red', legend=False)
plt.show()

Mapping vs. setting

plt.figure(figsize=(8, 6))
sns.scatterplot(data=loan50, x='annual_income', y='interest_rate', hue='homeownership', palette='colorblind')
sns.lineplot(data=loan50, x='annual_income', y='interest_rate', color='red', legend=False)
plt.show()

Recap: Aesthetic mappings

  • Aesthetic mapping defined at the local level will be used only by the elements they’re defined for.

  • Setting colors produces a manual color aesthetic, while mapping assigns colors automatically based on the qualifier.

Aside: Legends

plt.figure(figsize=(8, 6))
sns.scatterplot(data=loan50, x='annual_income', y='interest_rate', hue='homeownership', style='homeownership')
plt.legend(title='Home ownership')
plt.show()

Categories

Categorical

  • Categorical variables — variables that have a fixed and known set of possible values — are used in the pandas library.
  • They are also useful when you want to display character vectors in a non-alphabetical order.

Bar plot

plt.figure(figsize=(8, 6))
sns.countplot(data=loan50, x='homeownership')
plt.show()

Bar plot - reordered

loan50['homeownership'] = pd.Categorical(loan50['homeownership'], categories=['mortgage', 'rent', 'own'])
plt.figure(figsize=(8, 6))
sns.countplot(data=loan50, x='homeownership')
plt.show()

Frequency table

loan50['homeownership'].value_counts()
homeownership
mortgage    26
rent        21
own          3
Name: count, dtype: int64

Under the hood

print(type(loan50['homeownership']))
<class 'pandas.core.series.Series'>
print(loan50['homeownership'].dtype)
category
print(loan50['homeownership'].cat.categories)
Index(['mortgage', 'rent', 'own'], dtype='object')
print(loan50['homeownership'])
0         rent
1         rent
2     mortgage
3         rent
4     mortgage
5     mortgage
6         rent
7     mortgage
8         rent
9     mortgage
10        rent
11    mortgage
12        rent
13    mortgage
14        rent
15    mortgage
16        rent
17        rent
18        rent
19    mortgage
20    mortgage
21    mortgage
22    mortgage
23        rent
24    mortgage
25        rent
26    mortgage
27         own
28    mortgage
29    mortgage
30        rent
31    mortgage
32    mortgage
33        rent
34        rent
35         own
36    mortgage
37        rent
38    mortgage
39        rent
40    mortgage
41        rent
42        rent
43    mortgage
44    mortgage
45    mortgage
46    mortgage
47        rent
48         own
49    mortgage
Name: homeownership, dtype: category
Categories (3, object): ['mortgage', 'rent', 'own']

Recap: Categorical

  • The pandas.Categorical type is useful for dealing with categorical data and their levels.

  • Factors and the order of their levels are relevant for displays (tables, plots) and they’ll be relevant for modeling (later in the course).

  • Categorical is a data class in pandas.

Aside: ==

loan50['homeownership_new'] = loan50['homeownership'].apply(lambda x: "don't own" if x == 'rent' else x)
loan50[['homeownership', 'homeownership_new']].drop_duplicates()
homeownership homeownership_new
0 rent don't own
2 mortgage mortgage
27 own own

Aside: Filtering

loan50['homeownership_new'] = loan50['homeownership'].apply(lambda x: "don't own" if x in ['rent', 'mortgage'] else x)
loan50[['homeownership', 'homeownership_new']].drop_duplicates()
homeownership homeownership_new
0 rent don't own
2 mortgage don't own
27 own own

Other questions?