# Import all required librariesimport pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltfrom sklearn.preprocessing import StandardScaler, MaxAbsScaler, MinMaxScalerfrom sklearn.impute import SimpleImputerimport statsmodels.api as smimport scipy.stats as stats# Increase font size of all Seaborn plot elementssns.set(font_scale =1.25)# Set Seaborn themesns.set_theme(style ="whitegrid")
Data preprocessing
Data preprocessing
Data preprocessing refers to the manipulation, filtration, or augmentation of data before it is analyzed. It is a crucial step in the data science process.
It’s essentially data cleaning.
Dataset
Human Freedom Index
The Human Freedom Index is a report that attempts to summarize the idea of “freedom” through variables for many countries around the globe.
Question
What trends are there within human freedom indices in different regions?
hf_score_above_threshold countries
5 True Australia
27 True Canada
41 True Denmark
63 True Hong Kong
70 True Ireland
... ... ...
1359 True Hong Kong
1403 True New Zealand
1407 True Norway
1436 True Switzerland
1450 True United Kingdom
[83 rows x 2 columns]
Normalization of Variance: Centers data around zero with a standard deviation of one, suitable for algorithms assuming normally distributed data (e.g., linear regression, logistic regression, neural networks).
Preserves Relationships: Maintains ratios and differences between data points.
Cons:
Sensitive to Outliers: Outliers can distort scaled values.
Assumes Normality: Assumes data follows a Gaussian distribution.
Pros + cons (MaxAbsScaler)
Pros:
Outlier Resistant: Less sensitive to outliers, scales based on the absolute maximum value.
Preserves Sparsity: Does not center data, preserving the sparsity pattern.
Cons:
Scale Limitation: Scales to the range [-1, 1], which may not suit all algorithms.
Not Zero-Centered: May be a limitation for algorithms preferring zero-centered data.
Pros + cons (MinMaxScaler)
Pros:
Fixed Range: Scales data to a fixed range (usually [0, 1]), beneficial for algorithms sensitive to feature scales (e.g., neural networks, k-nearest neighbors).
Preserves Relationships: Maintains relationships between data points.
Cons:
Sensitive to Outliers: Outliers can skew scaled values.
Range Dependence: Scaling depends on the min and max values in the training data, which may not generalize well to new data.