Healthcare Equity Explorer

Overview

Chronic conditions like hypertension, diabetes, and chronic pain affect millions of people and are a key driver of hospital readmissions.

In this project, you’ll work with a synthesized version of the healthcare dataset through (Synthea)[https://synthetichealth.github.io/synthea/], modeled on Arizona patients, to build a predictive model that identifies patients at high risk of being readmitted to the hospital within 30 days.

This challenge centers on ethical and practical dimensions of healthcare disparities—students will explore real-world clinical and demographic factors and build a machine learning model to support equitable healthcare outcomes.

Objectives

In this project, you will:

  1. Explore and Analyze: Conduct exploratory data analysis (EDA) to understand patient encounters, demographics, and comorbidities.

  2. Clean and Prepare: Handle missing values, encode categorical variables, and standardize numerical inputs.

  3. Engineer Features: Enrich the dataset with new features such as chronic condition flags, cost aggregations, or social determinants.

  4. Build a Predictive Model: Train a classifier to predict whether a patient will be readmitted within 30 days of discharge.

  5. Evaluate and Communicate: Measure model performance using ROC AUC and present a clear rationale for your decisions.

Dataset

The dataset has been derived from a simulation of Synthea EHR data focused on Arizona residents. Each row represents a hospital encounter, and columns represent patient demographics, costs, conditions, vitals, medications, and more.

Feature Name Description
encounter_id Unique ID for the patient encounter
patient_id Unique ID for the patient
age Age of the patient at time of encounter
gender, race, etc. Demographic characteristics
payer_type Insurance coverage type (e.g., private, government, uninsured)
num_meds, total_med_cost Medication burden and cost during the encounter
has_chronic_pain Binary flag for chronic pain diagnosis
pain_score Reported pain level (0–10)
readmitted_within_30_days Target variable indicating a readmission within 30 days

Deliverables

You will submit the following:

  1. Jupyter Notebook:
    • Full notebook that documents your approach, including EDA, modeling, and results.
    • Well-commented and structured for reproducibility.
  2. Prediction CSV File:
    • Format: encounter_id, readmitted_within_30_days
    • Predictions for the test set in Codabench.
  3. Summary of Findings:
    • A brief write-up summarizing your strategy, key decisions, and insights.

Evaluation Criteria

Your submission will be evaluated based on:

  1. Preprocessing (20%) – Data wrangling, feature cleanup, handling missingness.
  2. EDA (20%) – Insightful visualizations and meaningful hypotheses.
  3. Modeling (30%) – Performance (ROC AUC) and appropriate modeling decisions.
  4. Feature Engineering (15%) – Inclusion of thoughtful, useful features.
  5. Communication (15%) – Clarity and structure of the notebook and analysis.

Bonus Points

  • Use of cross-validation or calibration techniques.
  • Thoughtful commentary on ethical implications of modeling patient data.

Top 5 leaderboard entries on the Test Phase will receive +5% extra credit toward the final course grade!

Reproducibility + Organization

All deliverables must be reproducible and committed to a clean GitHub repository. Points will be awarded for:

  • Clean folder structure (/src, /data, /notebooks, etc.)
  • Descriptive README.md
  • Minimal extraneous files or outputs

Getting Started

  1. Download the Dataset: Get the training and dev data from the Codabench site.

  2. Train Your Model: Use the provided baseline models (logistic_baseline.py, train_predict.py) or start from scratch.

  3. Evaluate and Submit: Submit predictions to Codabench during both the dev and test phases.

  4. Track Your Progress: Watch your rank on the leaderboard!

Timeline

  • Launch: March 25, 2025
  • Development Phase Ends: April 15, 2025
  • Test Phase Opens: April 16–17, 2025

Tools and Resources

  • Language: Python
  • Libraries: pandas, scikit-learn, xgboost, matplotlib, seaborn
  • Platform: Codabench
  • Resources: Synthea schema, scikit-learn docs, ROC AUC scoring guides

Late Work Policy

There is no late work accepted on this project.
Make sure to upload early and verify your submissions on Codabench.