Healthcare Equity Explorer
Overview
Chronic conditions like hypertension, diabetes, and chronic pain affect millions of people and are a key driver of hospital readmissions.
In this project, you’ll work with a synthesized version of the healthcare dataset through (Synthea)[https://synthetichealth.github.io/synthea/], modeled on Arizona patients, to build a predictive model that identifies patients at high risk of being readmitted to the hospital within 30 days.
This challenge centers on ethical and practical dimensions of healthcare disparities—students will explore real-world clinical and demographic factors and build a machine learning model to support equitable healthcare outcomes.
Objectives
In this project, you will:
Explore and Analyze: Conduct exploratory data analysis (EDA) to understand patient encounters, demographics, and comorbidities.
Clean and Prepare: Handle missing values, encode categorical variables, and standardize numerical inputs.
Engineer Features: Enrich the dataset with new features such as chronic condition flags, cost aggregations, or social determinants.
Build a Predictive Model: Train a classifier to predict whether a patient will be readmitted within 30 days of discharge.
Evaluate and Communicate: Measure model performance using ROC AUC and present a clear rationale for your decisions.
Dataset
The dataset has been derived from a simulation of Synthea EHR data focused on Arizona residents. Each row represents a hospital encounter, and columns represent patient demographics, costs, conditions, vitals, medications, and more.
Feature Name | Description |
---|---|
encounter_id |
Unique ID for the patient encounter |
patient_id |
Unique ID for the patient |
age |
Age of the patient at time of encounter |
gender , race , etc. |
Demographic characteristics |
payer_type |
Insurance coverage type (e.g., private, government, uninsured) |
num_meds , total_med_cost |
Medication burden and cost during the encounter |
has_chronic_pain |
Binary flag for chronic pain diagnosis |
pain_score |
Reported pain level (0–10) |
readmitted_within_30_days |
Target variable indicating a readmission within 30 days |
Deliverables
You will submit the following:
- Jupyter Notebook:
- Full notebook that documents your approach, including EDA, modeling, and results.
- Well-commented and structured for reproducibility.
- Prediction CSV File:
- Format:
encounter_id
,readmitted_within_30_days
- Predictions for the test set in Codabench.
- Format:
- Summary of Findings:
- A brief write-up summarizing your strategy, key decisions, and insights.
Evaluation Criteria
Your submission will be evaluated based on:
- Preprocessing (20%) – Data wrangling, feature cleanup, handling missingness.
- EDA (20%) – Insightful visualizations and meaningful hypotheses.
- Modeling (30%) – Performance (ROC AUC) and appropriate modeling decisions.
- Feature Engineering (15%) – Inclusion of thoughtful, useful features.
- Communication (15%) – Clarity and structure of the notebook and analysis.
Bonus Points
- Use of cross-validation or calibration techniques.
- Thoughtful commentary on ethical implications of modeling patient data.
Top 5 leaderboard entries on the Test Phase will receive +5% extra credit toward the final course grade!
Reproducibility + Organization
All deliverables must be reproducible and committed to a clean GitHub repository. Points will be awarded for:
- Clean folder structure (
/src
,/data
,/notebooks
, etc.) - Descriptive
README.md
- Minimal extraneous files or outputs
Getting Started
Download the Dataset: Get the training and dev data from the Codabench site.
Train Your Model: Use the provided baseline models (
logistic_baseline.py
,train_predict.py
) or start from scratch.Evaluate and Submit: Submit predictions to Codabench during both the dev and test phases.
Track Your Progress: Watch your rank on the leaderboard!
Timeline
- Launch: March 25, 2025
- Development Phase Ends: April 15, 2025
- Test Phase Opens: April 16–17, 2025
Tools and Resources
- Language: Python
- Libraries: pandas, scikit-learn, xgboost, matplotlib, seaborn
- Platform: Codabench
- Resources: Synthea schema, scikit-learn docs, ROC AUC scoring guides
Late Work Policy
There is no late work accepted on this project.
Make sure to upload early and verify your submissions on Codabench.