Real-Time Clinical Intelligence Engine

Summary

This project presents a production-grade Real-Time Clinical Intelligence Engine that addresses the issue of "alert fatigue" by embedding directly into existing Electronic Health Record (EHR) workflows via the HL7 CDS Hooks 1.0 protocol. Instead of requiring clinicians to use a separate dashboard, the system operates entirely in the background, analyzing FHIR R4 prefetch data through an XGBoost model to predict in-hospital mortality risk. Crucially, the engine inherently prioritizes explainability and governance, generating per-patient SHAP attributions, immutable audit logs, and an FDA GMLP Model Card to ensure every alert is both interpretable and accountable.

The Problem With How Hospitals Currently Use AI

Every year, sepsis kills approximately 270,000 Americans — more than prostate cancer, breast cancer, and AIDS combined. It is simultaneously one of the most lethal and one of the most preventable conditions in modern medicine. The clinical evidence is unambiguous: for every hour that appropriate antibiotic therapy is delayed after sepsis onset, in-hospital mortality rises by roughly 7%. Hours matter. Sometimes minutes matter.

The healthcare industry recognized this urgency and responded the way it typically does — by building dashboards. Sepsis dashboards. Early warning score dashboards. Risk stratification dashboards. Hospitals invested millions of dollars in these systems, and the outcomes were deeply disappointing. Not because the underlying models were wrong. Not because the data was unavailable. But because the tools required clinicians to change their behavior: to remember to open a separate application, to log into a different system, to manually check a score that wasn’t part of their natural workflow. Research consistently shows that standalone clinical AI tools achieve less than 10% sustained adoption after six months in production. The technology worked. The workflow integration didn’t.

The deeper problem is what researchers call alert fatigue — a well-documented cognitive phenomenon in which clinicians, bombarded by hundreds of passive EHR notifications per shift, develop a systematic habit of dismissing alerts without reading them. Studies have found that in some hospital systems, physicians override more than 90% of drug interaction alerts. When every notification competes equally for attention, the truly critical ones get lost in the noise. A sepsis warning that fires at the same visual priority as a reminder to update a patient’s allergy list is not a sepsis warning — it is background noise wearing clinical clothing.

This project is built on a fundamentally different premise: the most effective clinical AI is the one that requires the least behavioral change from the clinician. Instead of asking physicians to come to the intelligence, this system brings the intelligence to the physician — embedded directly inside the EHR interface, at the exact clinical moment when it is most actionable, without requiring a single additional login, tab switch, or manual query.

What This System Actually Does

At its core, this project is a fully specification-compliant HL7 CDS Hooks 1.0 service — a production-standard protocol that Epic, Oracle Health (Cerner), MEDITECH, and athenahealth all support natively in their EHR platforms today. CDS Hooks defines a precise contract between an EHR and an external intelligence service: the EHR fires a structured webhook at specific clinical workflow moments (called “hooks”), the external service processes the request and returns a formatted “CDS Card,” and that card appears inline in the clinician’s existing interface — no new window, no separate login, no workflow disruption.

When a clinician opens an adult ICU patient’s chart — the patient-view hook event — the following sequence executes automatically, entirely in the background, in under two seconds:

1. The EHR fires a structured HTTPS POST request to the Cloud Run service, containing a FHIR R4 prefetch bundle: a standardized JSON payload carrying the patient’s recent vital signs (heart rate, systolic blood pressure, respiratory rate, temperature, oxygen saturation) and laboratory values (serum lactate, white blood cell count, creatinine, total bilirubin, platelet count, bicarbonate) from the first six hours of ICU admission, along with the patient’s age and active diagnoses.

2. The FHIR parser extracts 12 clinical features from the prefetch bundle using LOINC codes — the international standard vocabulary for laboratory and clinical observations. This is not a simple field lookup; LOINC mapping handles the reality that different EHR vendors encode the same lab value under different identifiers, making the parser portable across hospital systems.

3. The XGBoost inference engine runs the extracted features through a binary mortality classifier trained on open-access ICU data from the eICU Collaborative Research Database and the MIMIC-IV Clinical Database. The model outputs a continuous probability score between 0 and 1 representing predicted in-hospital mortality risk, which is then mapped to three clinical alert levels: HIGH (>70%), MODERATE (40–70%), and LOW (<40%), corresponding to CDS Hooks indicator levels critical, warning, and info.

4. The SHAP TreeExplainer computes a per-patient, game-theoretic feature attribution — not a global feature importance ranking, but a specific explanation for this patient, at this moment. The top three contributing clinical factors (e.g., “Serum Lactate 4.2 mmol/L ↑ Increases risk”) are embedded directly in the CDS Card detail, so the clinician understands not just the risk score but the specific physiological drivers behind it. This is the difference between a black-box alert and an actionable clinical insight.

5. An immutable audit record is written to Firebase Firestore before the response is returned — capturing the timestamp, model version, hashed patient identifier (no PHI stored), input feature values, predicted probability, risk classification, and SHAP attributions. This append-only log is the foundation of post-market surveillance: the ability to retrospectively audit every inference the model has ever made, detect performance drift over time, and demonstrate regulatory accountability.

6. The formatted CDS Card is returned to the EHR in under two seconds, appearing inline in the clinician’s existing patient chart view — with the risk level, confidence score, SHAP explanation table, a suggested clinical action (“Order Sepsis Bundle within 1 hour”), and a direct link to the full model card documenting the system’s training data, performance metrics, known limitations, and bias audit results.

Below is the interactive logic flow for the CDS pipeline.

Instead of requiring a physician to open a specific tab or manually calculate risk scores, this system operates entirely in the background. When specific clinical events occur within the EHR (such as an encounter opening or a lab value updating), the EHR automatically triggers an asynchronous HTTPS request to our external machine learning service. The service parses the raw data, evaluates an optimized XGBoost predictive model, runs a local game-theoretic interpretability model (SHAP), logs an immutable transaction record to a secure database for auditing, and returns a formatted “CDS Card” directly into the user interface of the clinician.

Why This Architecture Is Clinically and Technically Significant

Three design decisions distinguish this system from a typical ML portfolio project:

Explainability is not optional. The system is architecturally incapable of returning a risk score without a per-patient explanation. The SHAP computation is not a post-hoc visualization layer that can be skipped — it runs on every inference, and its output is embedded in the CDS Card that the EHR displays. This is a direct response to one of the FDA’s core Good Machine Learning Practice (GMLP) principles: that AI/ML-based Software as a Medical Device (SaMD) must support human oversight, and that clinicians must be able to understand why a system is flagging a patient, not just that it is.

Governance is built in, not bolted on. Every inference is logged before the response is returned. The model card — documenting training data provenance, performance metrics, subgroup bias audit results, known limitations, and version history — is auto-generated from actual training run outputs, not manually written. The demographic bias audit uses the fairlearn library to compute equalized odds difference and demographic parity difference across age groups, gender, and race/ethnicity subgroups, with the explicit goal of detecting whether the model performs systematically worse for any patient population. This is not a checkbox exercise — it is the minimum standard for any clinical AI system that will be evaluated by a hospital’s AI governance committee or a regulatory body.

The deployment target is production-realistic. The service runs on Google Cloud Run — a serverless container platform that scales to zero when idle (no cost) and responds to requests in under two seconds (no 30-second cold-start penalty). It is deployed via Cloud Build CI/CD: a push to the GitHub main branch automatically triggers a Docker image build, pushes it to Container Registry, and deploys the new revision to Cloud Run. This is the same deployment pattern used by production healthcare AI services. The CORS configuration, the CDS Hooks discovery endpoint, the prefetch template schema, the card indicator levels — every detail conforms to the HL7 CDS Hooks 1.0 specification that EHR vendors implement.

+----------------+       1. Hook Trigger       +--------------------+
|  Hospital EHR  | --------------------------> |  Cloud Run Engine  |
|                | <-------------------------- |  (FastAPI Service) |
+----------------+       6. Return Card        +--------------------+
        ^                                                |
        | 2. Fetch Prefetch Data                         | 3. Execute Model
        v                                                v
+----------------+                             +--------------------+
| FHIR R4 Server |                             |  XGBoost Binary    |
+----------------+                             |  Inference Core    |
                                               +--------------------+
                                                         |
                                                         | 4. Compute SHAP
                                                         v
+----------------+     5. Immutable Log Audit  +--------------------+
| Firestore DB   | <-------------------------- | SHAP Explainer     |
+----------------+                             +--------------------+

The engineering cycle is divided into distinct execution phases:

Engineering Phases

Phase 0 — Infrastructure & Environment Setup

Before a single line of analytical code is written, the entire cloud infrastructure is provisioned and verified. This includes creating the Google Cloud project with Cloud Run and Cloud Build APIs enabled, initializing the Firebase project with Firestore in Native mode (the audit log database), connecting Cloud Build to the GitHub repository for CI/CD, configuring the Firebase service account and storing its credentials as a Cloud Run environment variable, and downloading the three open-access PhysioNet datasets — MIMIC-IV Demo, eICU Demo, and MIMIC-IV FHIR Demo — to Google Drive for use in Colab. No credentialing or institutional approval is required; all three datasets are freely available. The phase ends with a 10-item verification checklist confirming every dependency is in place before Phase 1 begins.

Phase 1 — Clinical Data Pipeline (Notebook 1)

The raw MIMIC-IV Demo dataset contains six relational tables: hospital admissions, patient demographics, ICU stays, ICD diagnosis codes, time-series vital sign measurements (chartevents), and time-series laboratory values (labevents). Phase 1 transforms this into a single analytical feature matrix suitable for machine learning. The sepsis cohort is defined using Sepsis-3 criteria — ICD-10 codes A40.x and A41.x combined with ICU admission within 48 hours of hospital admission — and filtered to adult patients (age ≥ 18) with ICU stays of at least 6 hours. Vital signs and lab values are extracted from the first 6 hours of each ICU stay and aggregated to a single row per patient. Missing values are handled using class-stratified median imputation — computing separate medians for survivors and non-survivors — to avoid introducing survival bias into the imputed values. Features are clipped to physiologically plausible bounds to remove data entry errors. The phase produces two outputs: sepsis_cohort_features.csv (the feature matrix) and feature_medians.csv (the imputation reference values that the CDS service uses at inference time to handle missing lab values in real patients).

Phase 2 — XGBoost Model Training + SHAP Explainability (Notebook 2)

The feature matrix is split into stratified train/validation/test sets (70/15/15), preserving the mortality class ratio across all three splits. Feature selection proceeds in three stages: Variance Inflation Factor (VIF) filtering to remove multicollinear features, Recursive Feature Elimination with Cross-Validation (RFECV) to identify the optimal feature subset, and clinical review to ensure retained features are physiologically interpretable. XGBoost is tuned via GridSearchCV with 5-fold stratified cross-validation, with scale_pos_weight set to the inverse class ratio to handle the natural imbalance between survivors and non-survivors in ICU data. The optimal classification threshold is determined by Youden’s J statistic on the validation set rather than defaulting to 0.5, which is almost never the right threshold for imbalanced clinical data. Platt scaling calibration is applied if it improves the Brier score. SHAP TreeExplainer generates both global feature importance visualizations and the per-patient attribution vectors that the CDS service uses at inference time. Five artifacts are exported: the trained model, the SHAP explainer, the feature name schema, the feature medians, and the test set with predictions (for the bias audit in Phase 3).

Phase 3 — Demographic Bias Audit + FDA GMLP Model Card (Notebook 3)

A model that performs well on average can still perform systematically worse for specific patient populations — and in clinical AI, that disparity can translate directly into differential mortality outcomes. Phase 3 audits the trained model across three demographic axes: age group (18–44, 45–64, 65+), gender (male/female), and race/ethnicity (five groups). Per-subgroup AUROC is computed with 95% bootstrap confidence intervals. The fairlearn MetricFrame computes equalized odds difference (the maximum disparity in false positive or false negative rate across subgroups) and demographic parity difference (the maximum disparity in selection rate). The FDA GMLP Model Card is auto-generated from actual training run outputs — not manually written — and includes intended use, out-of-scope uses, training data provenance, feature definitions, overall and subgroup performance metrics, known limitations, and version history. This card is published to Firebase Hosting and linked from every CDS Card the service returns.

Phase 4 — FastAPI CDS Hooks Service + Cloud Run Deployment

The trained model artifacts are packaged into a FastAPI microservice with three endpoints: GET /health (Cloud Run health check), GET /cds-services (CDS Hooks discovery — returns the hook definition, prefetch template, and service metadata that the EHR uses to configure the integration), and POST /cds-services/sepsis-risk (the hook endpoint that receives FHIR prefetch bundles and returns CDS Cards). The service is containerized using a Python 3.11-slim Docker image and deployed to Cloud Run via Cloud Build CI/CD. CORS is configured to allow requests from the Firebase Hosting origin and the HAPI FHIR sandbox. The Firestore audit logger writes an immutable record on every inference before the response is returned. The entire deployment — from a git push to a live HTTPS endpoint — takes approximately 3–5 minutes on first build and under 2 minutes on subsequent pushes.

Phase 5 — FHIR Test Harness + End-to-End Validation (Notebook 4)

With the service live on Cloud Run, Phase 5 fires real CDS Hooks requests against the production endpoint using five synthetic FHIR R4 patient bundles designed to cover the critical test cases: a high-risk patient (elevated lactate, tachycardia, thrombocytopenia), a moderate-risk patient (borderline vitals), a low-risk patient (normal values), an elderly patient (age 78, testing age-related model behavior for the bias audit), and a patient with missing lab values (testing the imputation fallback path). Each response is validated against the HL7 CDS Hooks 1.0 JSON schema — confirming that the summary field is under 140 characters, the indicator is a valid enum value, and the source attribution is present. The Firestore audit log is queried to confirm that all five inferences were recorded with correct metadata. The high-risk patient’s CDS Card response is exported as demo_cds_response.json for use in the portfolio page. Low ()

Phase 6 — Svelte Portfolio Page + Firebase Hosting

The portfolio entry page is a single Svelte component that presents the complete project to a technical or clinical audience without requiring them to read the code. It includes the problem narrative, the architecture diagram, a live rendering of the demo CDS Card (showing exactly what a clinician would see in their EHR), the performance metrics table, the bias audit summary, an expandable phase-by-phase accordion showing what was built and what tools were used, and two live external links — the Cloud Run discovery endpoint and the Firebase-hosted Model Card. The page is deployed to Firebase Hosting and serves as the shareable portfolio URL: a live, working demonstration of a production-standard clinical AI system accessible from any browser without login.

What This Demonstrates to a Technical Reviewer

A standard ML portfolio demonstrates that a candidate can train a model and evaluate it. This project demonstrates something considerably more specific: the ability to build a complete clinical AI system that meets the standards a hospital’s AI governance committee, a health system CTO, or an FDA reviewer would actually apply. The CDS Hooks specification compliance, the FHIR R4 prefetch parsing, the LOINC-based feature extraction, the per-patient SHAP explanations embedded in the card, the immutable audit trail, the demographic bias audit with fairlearn, the auto-generated FDA GMLP Model Card, and the Cloud Run + Cloud Build CI/CD deployment — none of these are standard ML portfolio components. Together, they represent the full stack of what it takes to move clinical AI from a Jupyter notebook to a production EHR integration.

Dataset sources

1. eICU Collaborative Research Database - https://physionet.org/content/eicu-crd-demo/2.0/

2. MIMIC-IV Clinical Database Demo on FHIR - https://physionet.org/content/mimic-iv-fhir-demo/2.1.0/

3. MIMIC-IV Clinical Database Demo v2.2 - https://physionet.org/content/mimic-iv-demo/2.2/

Part - I: Sepsis Intelligence at the Point of Care