Healthcare Equity Explorer

Overview

Chronic conditions like hypertension, diabetes, and chronic pain affect millions of people and are a key driver of hospital readmissions.

In this project, you’ll work with a synthesized version of the healthcare dataset through Synthea, modeled on Arizona patients, to build a predictive model that identifies patients at high risk of being readmitted to the hospital within 30 days.

This challenge centers on ethical and practical dimensions of healthcare disparities—students will explore real-world clinical and demographic factors and build a machine learning model to support equitable healthcare outcomes.

Objectives

In this project, you will:

Explore and Analyze: Conduct exploratory data analysis (EDA) to understand patient encounters, demographics, and comorbidities.
Clean and Prepare: Handle missing values, encode categorical variables, and standardize numerical inputs.
Engineer Features: Enrich the dataset with new features such as chronic condition flags, cost aggregations, or social determinants.
Build a Predictive Model: Train a classifier to predict whether a patient will be readmitted within 30 days of discharge.
Evaluate and Communicate: Measure model performance using ROC AUC and present a clear rationale for your decisions.

Dataset

The dataset has been derived from a simulation of Synthea EHR data focused on Arizona residents. Each row represents a hospital encounter, and columns represent patient demographics, costs, conditions, vitals, medications, and more.

Feature Name	Description
`encounter_id`	Unique ID for the patient encounter
`patient_id`	Unique ID for the patient
`age`	Age of the patient at time of encounter
`gender`, `race`, etc.	Demographic characteristics
`payer_type`	Insurance coverage type (e.g., private, government, uninsured)
`num_meds`, `total_med_cost`	Medication burden and cost during the encounter
`has_chronic_pain`	Binary flag for chronic pain diagnosis
`pain_score`	Reported pain level (0–10)
`readmitted_within_30_days`	Target variable indicating a readmission within 30 days

Deliverables

You will submit the following:

Jupyter Notebook:
- Full notebook that documents your approach, including EDA, modeling, and results.
- Well-commented and structured for reproducibility.
Prediction CSV File:
- Format: encounter_id, readmitted_within_30_days
- Predictions for the test set in Codabench.
Summary of Findings:
- A brief write-up summarizing your strategy, key decisions, and insights.

Evaluation Criteria

Your submission will be evaluated based on:

Preprocessing (20%) – Data wrangling, feature cleanup, handling missingness.
EDA (20%) – Insightful visualizations and meaningful hypotheses.
Modeling (30%) – Performance (ROC AUC) and appropriate modeling decisions.
Feature Engineering (15%) – Inclusion of thoughtful, useful features.
Communication (15%) – Clarity and structure of the notebook and analysis.

Bonus Points

Use of cross-validation or calibration techniques.
Thoughtful commentary on ethical implications of modeling patient data.

Top 5 leaderboard entries on the Test Phase will receive +5% extra credit toward the final course grade!

Reproducibility + Organization

All deliverables must be reproducible and committed to a clean GitHub repository. Points will be awarded for:

Clean folder structure (/src, /data, /notebooks, etc.)
Descriptive README.md
Minimal extraneous files or outputs

Getting Started

Download the Dataset: Get the training and dev data from the Codabench site.
Train Your Model: Use the provided baseline models (logistic_baseline.py, train_predict.py) or start from scratch.
Evaluate and Submit: Submit predictions to Codabench during both the dev and test phases.
Track Your Progress: Watch your rank on the leaderboard!

Timeline

Launch: Oct 06, 2025
Development Phase Ends: Dec 15, 2025
Test Phase Opens: Dec 15–17, 2025

Tools and Resources

Language: Python
Libraries: pandas, scikit-learn, xgboost, matplotlib, seaborn
Platform: Codabench
Resources: Synthea schema, scikit-learn docs, ROC AUC scoring guides

Late Work Policy

There is no late work accepted on this project.
Make sure to upload early and verify your submissions on Codabench.