import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MaxAbsScaler, MinMaxScaler
import numpy as np
from nycflights13 import flights
AE 04: NYC flights + data preprocessing
Application exercise
Exercise 1 - Load data
Fill in the blanks:
# add code here
The flights
data frame has __ rows. Each row represents a __.
Exercise 2 - Data cleaning
Remove rows with missing values in the arr_delay
and distance
columns.
What are the names of the variables in flights
.
# add code here
Exercise 3 - Original Data Distribution
- Plot the original distributions of
arr_delay
anddistance
.
# add code here
Exercise 4 - Check for Skewness
- Calculate and print the skewness of
arr_delay
anddistance
.
# add code here
Exercise 5 - Scaling
- Check the summary statistics of
arr_delay
anddistance
to see if scaling is necessary.
# add code here
# add code here
- Question: Do
arr_delay
anddistance
need to be scaled? Why?
add response here.
- Apply standard scaling, maximum absolute scaling, and Min-Max Scaling to the transformed
arr_delay
anddistance
. - Hint: use the framework
df_clean.loc[:, ['arr_delay_minmax', 'distance_minmax']]
to prevent errors
# add code here
- Question: What are the two pros and two cons of standardizing data?
Add response here.
Exercise 6 - Transformation
- Check the summary statistics again with your min-max standardized columns.
# add code here
# add code here
- Question: Why should you use the min-max scaled data instead of a different scaling for the transformations (hint: especially log transformation)
Add response here.
Apply a log transformation to
arr_delay
if it is positively skewed and apply a square root transformation todistance
if it is negatively skewed (useif
else
statements).Hint: Logical operators in Python:
operator definition <
is less than? <=
is less than or equal to? >
is greater than? >=
is greater than or equal to? ==
is exactly equal to? !=
is not equal to? x and y
is x AND y? x or y
is x OR y? pd.isna(x)
is x NA? ~pd.isna(x)
is x not NA? x in y
is x in y? x not in y
is x not in y? not x
is not x? (only makes sense if x
isTrue
orFalse
)
# add code here
- Question: Why do we have to add a constant when we perform a log or square-root transformation (i.e.,
np.log1p(df['column' + 1])
)?
add response here.