AE 10: Modelling fish

Suggested answers

Application exercise
Answers
Important

These are suggested answers. This document should be used as reference only, it’s not designed to be an exhaustive key.

For this application exercise, we will work with data on fish. The dataset we will use, called fish, is on two common fish species in fish market sales.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import numpy as np

fish = pd.read_csv("data/fish.csv")
print(fish.head())
  species  weight  length_vertical  length_diagonal  length_cross   height  \
0   Bream     242             23.2             25.4          30.0  11.5200   
1   Bream     290             24.0             26.3          31.2  12.4800   
2   Bream     340             23.9             26.5          31.1  12.3778   
3   Bream     363             26.3             29.0          33.5  12.7300   
4   Bream     430             26.5             29.0          34.0  12.4440   

    width  
0  4.0200  
1  4.3056  
2  4.6961  
3  4.4555  
4  5.1340  

The data dictionary is below:

variable description
species Species name of fish
weight Weight, in grams
length_vertical Vertical length, in cm
length_diagonal Diagonal length, in cm
length_cross Cross length, in cm
height Height, in cm
width Diagonal width, in cm

Visualizing the model

We’re going to investigate the relationship between the weights and heights of fish.

  • Create an appropriate plot to investigate this relationship. Add appropriate labels to the plot.
plt.figure(figsize=(8, 6))
sns.scatterplot(x='height', y='weight', data=fish, alpha=0.5)
plt.xlabel("Height (cm)")
plt.ylabel("Weight (g)")
plt.title("Relationship between Heights and Weights of Fish")
plt.show()

  • If you were to draw a a straight line to best represent the relationship between the heights and weights of fish, where would it go? Why?

    Start from the bottom and go up Identify the first and last point and draw a line through most the others.

    # Fit a linear model
    X = fish[['height']]
    y = fish['weight']
    model = LinearRegression()
    model.fit(X, y)
    
    # Predict values
    predictions = model.predict(X)
    
    plt.figure(figsize=(8, 6))
    sns.scatterplot(x='height', y='weight', data=fish, alpha=0.5)
    plt.plot(fish['height'], predictions, color="#325b74", linewidth=2)
    plt.xlabel("Height (cm)")
    plt.ylabel("Weight (g)")
    plt.title("Relationship between Heights and Weights of Fish with Linear Regression Line")
    plt.show()

    • What types of questions can this plot help answer?

      Is there a relationship between fish heights and weights of fish?

    We can use this line to make predictions. Predict what you think the weight of a fish would be with a height of 10 cm, 15 cm, and 20 cm. Which prediction is considered extrapolation?

    • At 10 cm, we estimate a weight of 375 grams. At 15 cm, we estimate a weight of 600 grams At 20 cm, we estimate a weight of 975 grams. 20 cm would be considered extrapolation.

    • What is a residual?

      Difference between predicted and observed.

Model fitting

  • Fit a model to predict fish weights from their heights.
model = LinearRegression()
model.fit(X, y)

fish['predicted_weight'] = model.predict(X)
fish['residual'] = fish['weight'] - fish['predicted_weight']
  • Predict what the weight of a fish would be with a height of 10 cm, 15 cm, and 20 cm using this model.
heights = np.array([[10], [15], [20]])
predicted_weights = model.predict(heights)
for height, weight in zip(heights, predicted_weights):
    print(f"Predicted weight for height {height[0]} cm: {weight:.2f} g")
Predicted weight for height 10 cm: 320.74 g
Predicted weight for height 15 cm: 625.32 g
Predicted weight for height 20 cm: 929.90 g
  • Calculate predicted weights for all fish in the data and visualize the residuals under this model.
fish['predicted_weight'] = model.predict(X)
fish['residual'] = fish['weight'] - fish['predicted_weight']

plt.figure(figsize=(8, 6))
sns.scatterplot(x='height', y='residual', data=fish, alpha=0.5)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel("Height (cm)")
plt.ylabel("Residual (g)")
plt.title("Residuals of the Linear Regression Model")
plt.show()

Model summary

  • Display the model summary including estimates for the slope and intercept along with measurements of uncertainty around them. Show how you can extract these values from the model output.
slope = model.coef_[0]
intercept = model.intercept_
print(f"Slope: {slope:.2f}")
print(f"Intercept: {intercept:.2f}")
Slope: 60.92
Intercept: -288.42
  • Write out your model using mathematical notation.

\(\widehat{weight} = -288 + 60.9 \times height\)

Correlation

We can also assess correlation between two quantitative variables.

correlation = fish['height'].corr(fish['weight'])
print(f"Correlation between height and weight: {correlation:.2f}")
Correlation between height and weight: 0.95

Adding a third variable

  • Does the relationship between heights and weights of fish change if we take into consideration species? Plot two separate straight lines for the Bream and Roach species.
plt.figure(figsize=(8, 6))
sns.lmplot(x='height', y='weight', hue='species', data=fish, markers=["o", "s"], ci=None, palette="muted", height=6, aspect=1.5)
plt.xlabel("Height (cm)")
plt.ylabel("Weight (g)")
plt.title("Relationship between Heights and Weights of Fish by Species")
plt.show()
<Figure size 768x576 with 0 Axes>

Fitting other models

  • We can fit more models than just a straight line. Change the sns.regplot() code you previously used to include lowess=True. What is different from the plot created before?
plt.figure(figsize=(8, 6))
sns.scatterplot(x='height', y='weight', data=fish, alpha=0.5)
sns.regplot(x='height', y='weight', data=fish, lowess=True, scatter_kws={'s':10}, line_kws={'color':'red'})
plt.xlabel("Height (cm)")
plt.ylabel("Weight (g)")
plt.title("Relationship between Heights and Weights of Fish with LOESS Smoothing")
plt.show()