import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MaxAbsScaler, MinMaxScaler
import numpy as np
from nycflights13 import flightsAE 04: NYC flights + data preprocessing
Application exercise
Exercise 1 - Load data
Fill in the blanks:
# add code hereThe flights data frame has __ rows. Each row represents a __.
Exercise 2 - Data cleaning
Remove rows with missing values in the arr_delay and distance columns.
What are the names of the variables in flights.
# add code hereExercise 3 - Original Data Distribution
- Plot the original distributions of
arr_delayanddistance.
# add code hereExercise 4 - Check for Skewness
- Calculate and print the skewness of
arr_delayanddistance.
# add code hereExercise 5 - Scaling
- Check the summary statistics of
arr_delayanddistanceto see if scaling is necessary.
# add code here# add code here- Question: Do
arr_delayanddistanceneed to be scaled? Why?
add response here.
- Apply standard scaling, maximum absolute scaling, and Min-Max Scaling to the transformed
arr_delayanddistance. - Hint: use the framework
df_clean.loc[:, ['arr_delay_minmax', 'distance_minmax']]to prevent errors
# add code here- Question: What are the two pros and two cons of standardizing data?
Add response here.
Exercise 6 - Transformation
- Check the summary statistics again with your min-max standardized columns.
# add code here# add code here- Question: Why should you use the min-max scaled data instead of a different scaling for the transformations (hint: especially log transformation)
Add response here.
Apply a log transformation to
arr_delayif it is positively skewed and apply a square root transformation todistanceif it is negatively skewed (useifelsestatements).Hint: Logical operators in Python:
operator definition <is less than? <=is less than or equal to? >is greater than? >=is greater than or equal to? ==is exactly equal to? !=is not equal to? x and yis x AND y? x or yis x OR y? pd.isna(x)is x NA? ~pd.isna(x)is x not NA? x in yis x in y? x not in yis x not in y? not xis not x? (only makes sense if xisTrueorFalse)
# add code here- Question: Why do we have to add a constant when we perform a log or square-root transformation (i.e.,
np.log1p(df['column' + 1]))?
add response here.