Flood Risk Prediction Challenge
Overview
Flooding is one of the most common and destructive natural disasters in the United States, causing billions of dollars in damage each year. Accurate flood risk prediction is essential for disaster preparedness, resource allocation, and mitigation efforts.
This project challenges you to analyze a real-world dataset of environmental and geospatial features and build a machine learning model that predicts flood risk scores for various regions. You will apply your skills in data cleaning, exploratory data analysis (EDA), feature engineering, and predictive modeling to solve this practical problem.
Objectives
In this project, you will:
Explore and Analyze: Conduct exploratory data analysis to uncover patterns and relationships in the dataset.
Clean and Prepare: Preprocess the data to handle missing values, outliers, and inconsistencies.
Engineer Features: Create new features to improve model performance.
Build a Predictive Model: Train a regression model to predict flood risk scores.
Evaluate and Communicate: Assess your model’s performance using metrics like Mean Absolute Error (MAE) and present your findings.
Dataset
Subject to change up to the project announcement.
The dataset includes environmental and historical data about regions in the United States. Each row represents a region, and the columns contain features relevant to flood risk prediction:
Feature Name | Description |
---|---|
region_id |
Unique identifier for each region |
rainfall |
Average annual rainfall (mm) |
river_flow |
Average river flow rate (m³/s) |
urban_density |
Percentage of urban area in the region |
elevation |
Average elevation above sea level (meters) |
historical_floods |
Number of floods recorded in the last decade |
flood_risk_score |
(Target variable) Normalized flood risk score (0-1) |
Deliverables
You will submit the following:
Jupyter Notebook:
Document your workflow with clear explanations and visualizations.
Include data preprocessing, EDA, feature engineering, and model development.
Prediction CSV File:
File containing your predictions for the test dataset in the required format.
Format:
region_id
,flood_risk_score
.
Summary of Findings:
- A concise explanation of your approach, insights, and key results (can be included in the notebook).
Evaluation Criteria
Your submission will be graded based on the following criteria:
Data Preprocessing (25%): Handling of missing values, outliers, and overall data cleaning process.
Exploratory Data Analysis (20%): Quality of visualizations and insights derived from the data.
Feature Engineering (20%): Creativity and relevance of new features and their impact on model performance.
Model Development (25%): Accuracy of predictions based on MAE, justification of chosen model, and optimization techniques.
Communication (10%): Clarity of the notebook, organization of the workflow, and explanation of findings.Other
Bonus Points
Cross-Validation: Applying cross-validation for more robust model evaluation.
Feature Importance Analysis: Using SHAP or feature importance tools to interpret your model.
Advanced Techniques: Exploring innovative methods to improve predictions.
A general breakdown of scoring is as follows:
- 90%-100%: Outstanding effort. Student understands how to apply all data science concepts, can put the results into a cogent argument, can identify weaknesses in the argument, and can clearly communicate the results to others.
- 80%-89%: Good effort. Student understands most of the concepts, puts together an adequate argument, identifies some weaknesses of their argument, and communicates most results clearly to others.
- 70%-79%: Passing effort. Student has misunderstanding of concepts in several areas, has some trouble putting results together in a cogent argument, and communication of results is sometimes unclear.
- 60%-69%: Struggling effort. Student is making some effort, but has misunderstanding of many concepts and is unable to put together a cogent argument. Communication of results is unclear.
- Below 60%: Student is not making a sufficient effort.
Reproducibility + organization
All written work (with exception of presentation slides) should be reproducible, and the GitHub repo should be neatly organized.
Points for reproducibility + organization will be based on the reproducibility of the write-up and the organization of the project GitHub repo. The repo should be neatly organized as described above, there should be no extraneous files, all text in the README should be easily readable.
Getting Started
Download the Dataset: Access the training and test datasets from the project repository.
Set Up Your Environment: Ensure you have Python and essential libraries installed (e.g., Pandas, NumPy, Scikit-learn, Matplotlib).
Follow the Workflow: Preprocess the data, conduct EDA, engineer features, train your model, and make predictions.
Submit Your Work: Upload your notebook and predictions to the provided submission platform.
Timeline
Start Date: TBD
Submission Deadline: TBD
Tools and Resources
Programming Language: Python
Key Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn
Resources: Documentation on regression models, MAE, and data visualization techniques.
Late work policy
There is no late work accepted on this project. Be sure to turn in your work early to avoid any technological mishaps.