HealthLake Data Pipeline

End-to-End Healthcare EHR Analytics on Delta Lake

Summary

This project constructs a comprehensive Lakehouse architecture using synthetic Electronic Health Record (EHR) data to demonstrate modern data engineering and analytics capabilities. It processes realistic patient data through a Medallion Architecture, transforming raw CSV files into highly structured, analytics-ready formats. By leveraging PySpark and Delta Lake, the pipeline ensures ACID compliance, time travel functionality, and robust schema enforcement. The progressively refined data supports both business intelligence dashboards and machine learning models designed for predicting 30-day patient readmission risk. The batch processing pipeline ultimately integrates with a real-time Svelte and Firebase application, providing a live monitoring interface for patient vitals and diagnostic aggregates. This end-to-end implementation bridges raw medical data ingestion with downstream clinical decision support, providing a scalable blueprint for healthcare data infrastructure.

Introduction

Layman Overview

Purpose: To build a system that takes messy, raw medical records (like patient visits, diagnoses, and lab results) and cleans them up step-by-step so healthcare administrators can view trends and predict which patients are at high risk of returning to the hospital.
Tech Stack: Google Colab (for running the code), Python (for processing logic), and Delta Lake (a smart storage system that prevents data corruption and remembers past versions of the files).

Technical Overview

Purpose: To implement a scalable Medallion Lakehouse architecture processing the Synthea synthetic EHR dataset. The pipeline demonstrates automated data ingestion, transformation, feature engineering, and orchestration, ultimately supporting a Business Intelligence (BI) dashboard and Machine Learning (ML) models for readmission prediction.
Tech Stack: PySpark (data processing), Delta Lake (storage layer), Python 3 (scripting), Google Colab (compute environment), Firebase (Realtime Database & Firestore for serving data), and Svelte (frontend dashboard).

HealthLake Dashboard

This is the integration test for the HealthLake interactive components.

Live Vitals

Analytics Overview

Machine Learning Predictions

Methods

Why Databricks Was Bypassed

I thought of using Databricks Free Edition (Community edition is obsolete as of this writing) but sign-up process aggressively routes users toward a 14-day Premium cloud trial. The initial workspace was inadvertently provisioned under this paid tier. Once the trial credits expired, standard all-purpose compute cluster creation was locked, leaving only paid Serverless SQL Warehouses available.

Google Colab was utilized as a zero-cost, immediate alternative. Because PySpark and Delta Lake are open-source technologies, the underlying pipeline logic executed flawlessly in Colab, proving that the architecture is environment-agnostic and not strictly tethered to Databricks proprietary compute.

Phase 1 Overview

Phase 1 focuses on data ingestion and transformation. It establishes the foundational Medallion Architecture by converting raw source files into structured, optimized Delta tables ready for downstream analysis.

Notebook 01: ETL & Delta Lake Operations

Data Generation: Programmatically generated a realistic 1,000-patient synthetic EHR dataset (demographics, encounters, conditions, medications, observations) using the Faker library.
Bronze Layer: Ingested the raw CSVs into Delta format as append-only tables, attaching ingestion timestamps and source file metadata to maintain a strict audit trail.
Silver Layer: Applied business logic to clean the data. This included enforcing data types (casting dates and integers), standardizing schemas to snake_case, deduplicating records, and calculating derived columns (e.g., patient age, encounter duration).
Gold Layer: Aggregated the cleaned data into specialized tables (patient_features, encounter_summary, diagnosis_counts) optimized for direct consumption by BI tools and ML algorithms.
Optimization: Executed Delta Lake commands like OPTIMIZE and ZORDER to compact files and co-locate data, significantly improving read performance for future queries.

Databricks Cluster Failure Analysis

The initial attempt to execute this pipeline on Databricks failed because the workspace was inadvertently provisioned under a Premium cloud trial with expired credits, rather than the intended Databricks Community Edition. This restriction disabled standard all-purpose cluster creation, limiting the compute interface to paid Serverless SQL Warehouses, which are incompatible with the free-tier project requirements.

Alternative Approach Rationality

The strategy was shifted to Google Colab to provide a free, isolated, and immediate compute environment. By installing pyspark and delta-spark directly via pip and mounting Google Drive for persistent storage, the pipeline successfully bypassed the locked Databricks cluster dependency. This alternative retained 100% of the required PySpark logic and Delta Lake capabilities without requiring a paid cloud infrastructure.

Phase 2 Overview

Phase 2 shifts the focus from data engineering to data analytics and serving. It utilizes the highly structured Gold tables established in Phase 1 to generate business intelligence insights and prepares those insights for external consumption by a frontend application.

Notebook 02: Business Intelligence

Environment Reconnection: Configured the Spark session within Google Colab to reconnect to the existing Medallion architecture (Silver and Gold Delta tables) stored on Google Drive.
Core BI Queries: Executed distributed Spark SQL queries to extract actionable clinical and financial metrics. This included identifying the top 10 most prevalent diagnoses, calculating average encounter costs by payer, computing 30-day readmission rates across age groups, and analyzing the financial impact of severe comorbidities.
Advanced SQL Patterns: Implemented complex data transformations using Window Functions (for tracking year-to-date encounter volumes) and Pivot Tables (to transpose encounter classes across age brackets for easier reporting).
Serving Layer Preparation (Dry Run): Extracted the queried data from Spark DataFrames and formatted it into structured JSON payloads. While the physical push to the Firebase database was bypassed due to the Colab environment constraints, the logic successfully validated the exact data structure required by the downstream Svelte dashboard.

Phase 3 Overview: AI & Data Science with MLflow

Overall Purpose
This phase demonstrates a complete machine learning lifecycle (MLOps) within the HealthLake project. The primary goal is to predict 30-day patient readmission by engineering features from Delta Lake tables, tracking experiments with MLflow, and deploying a production-ready model for batch inference.

Technical Stack

Data Processing: PySpark, Delta Lake (Delta Spark), Pandas
Machine Learning: Scikit-Learn (Logistic Regression, Random Forest), Spark ML
MLOps: MLflow (Tracking, Model Registry)
Cloud/Storage: Google Drive (DBFS equivalent), Delta Lake format

Execution Summary

Section 1: Setup & Feature Engineering: Initialized the Spark session with Delta support, mounted storage, and configured MLflow. Transformed raw Gold Delta tables into a feature matrix, engineered new ratios (inpatient/emergency), and prepared the data for Scikit-Learn to create a reproducible ML environment.
Section 2: MLflow Experiment Tracking: Trained two models (Logistic Regression and Random Forest). Logged hyperparameters, cross-validation metrics, and test performance (AUC-ROC/PR) to MLflow runs to compare algorithms objectively and keep a record of every experiment.
Section 3: MLflow Model Registry: Queried the MLflow Registry to find the best-performing model (Random Forest, AUC: 0.843) and programmatically promoted it to the ‘Production’ stage to manage the model lifecycle.
Section 4: Batch Inference: Used the production model to score all 1,000 patients, categorized them into risk tiers (Low/Medium/High), and wrote the results back to a new Delta Gold table (healthlake.ml_predictions) to generate actionable insights and persist predictions.
Section 5: Spark ML vs. Scikit-Learn: Re-implemented the Logistic Regression model using Spark ML (distributed) and achieved an AUC-ROC of 0.857, demonstrating how to scale the pipeline to millions of rows when data exceeds local memory limits.

Technical Interventions & Corrections

Resolved PySpark CANNOT_INFER_TYPE_FOR_FIELD: PySpark failed to infer the data type for the risktier column due to NumPy-specific str objects. Modified the logic to explicitly apply a lambda function (str(x) if pd.notnull(x) else None) to ensure the data consisted of native Python strings for correct Spark mapping.
Resolved Spark SCHEMA_NOT_FOUND (AnalysisException): The final table registration query failed because the database did not exist in the temporary Spark session’s metastore. Added spark.sql("CREATE DATABASE IF NOT EXISTS healthlake") prior to registration to ensure the namespace was available.
Corrected Feature Consistency Error: Mitigated the risk of duplicate columns or incorrect feature order during batch inference by ensuring the final inference DataFrame (X_all) used a strict selection based on the exact FEATURE_COLS_FINAL list established during training.

Phase 4 Overview: Real-Time Analytics & Orchestration

Purpose: This final phase transitions the architecture from static batch processing to live data streaming and automated orchestration. It demonstrates how Spark Structured Streaming processes simulated medical IoT data (patient vitals) with low latency to provide immediate clinical insights.

Technical Stack:

Processing: Apache Spark (Structured Streaming)
Storage: Delta Lake
Orchestration: Databricks Jobs & Workflows REST API concepts
Environment: Google Colab (Local Simulation)

Execution Summary:

Setup & Schema Definition: Initialized the streaming environment and defined explicit schemas (vitals_schema) required to ensure data integrity during real-time ingestion.
Vitals Simulation: Executed a local producer script to generate synthetic patient IoT data (heart rate, SpO2, BP) and write CSV files to a watched directory, triggering Spark micro-batches.
Streaming Consumer: Deployed readStream to ingest data, applied transformation logic to flag abnormal vitals (tachycardia/hypoxia), and utilized writeStream to append results to a Delta table with checkpointing for fault tolerance.
Windowed Aggregations: Calculated 5-minute rolling averages of patient vitals to monitor clinical trends rather than isolated anomalies.
Orchestration Definition: Programmatically defined a multi-step production DAG (ETL → BI → ML) using a JSON payload structured for the Databricks Jobs API.

Technical Interventions:

Resolved Orchestration SyntaxError: Reconstructed a broken try…except block in the workflow cell. Added a robust fallback mechanism to handle the missing Databricks configuration object. The code now defaults to LOCAL_COLAB_INSTANCE, allowing the workflow payload logic to execute successfully outside a Databricks workspace.

Project Retrospective: The 4 Phases

Phase 1 (Data Engineering): Established a Medallion architecture (Bronze, Silver, Gold), transforming raw Synthetic EHR CSVs into optimized Delta Lake tables.
Phase 2 (Business Intelligence): Executed distributed Spark SQL queries on Gold tables to extract readmission rates and financial metrics, formatting the output for frontend consumption.
Phase 3 (Data Science): Engineered features and tracked Logistic Regression and Random Forest models using MLflow, applying the best model to generate batch readmission predictions.
Phase 4 (Streaming): Implemented Structured Streaming to process simulated real-time IoT patient vitals and defined automated pipeline orchestration.

Phase 5: Frontend Integration & Deployment Readiness

The data infrastructure is complete. By shifting real-time data generation to a client-side simulation architecture, reliance on a continuous Python backend worker and the Firebase Realtime Database has been eliminated. The Svelte application now operates as a robust, standalone unit.

Analogy: like swapping a live chef for a vending machine — stocked once, serves itself forever.

Execution & Rendering Sequence

Initialize the Application — Execute npm run dev to launch the SvelteKit frontend locally.
Static Data Hydration — On load, the dashboard runs read-only queries against the Firebase Firestore (default) database to fetch the batch-processed BI metrics and ML predictions from Phases 2 and 3.
Live Telemetry Execution — The VitalsChart.svelte component mounts and triggers a local JavaScript setInterval loop. This simulates a live WebSocket stream, generating continuous cardiovascular parameters that adhere to physiological constraints and biological thresholds from clinical domain expertise.

Why It Matters

This decoupled architecture ensures the project functions permanently as a portfolio showcase. The dashboard delivers zero-latency interactivity for visitors without requiring persistent backend server maintenance, active cloud workers, or ongoing hosting costs.

From Data to Diagnosis