Clinical Mirror

Published: June 2026

Health InformaticsExplainable AI (XAI)AI Ethics & MLOpsInteractive Simulation

ClinicalMirror: Auditing AI Fairness and Explainability in Clinical Decision Support


Summary

For anyone who wants to understand what this project does without reading the technical details.

Hospitals increasingly use AI to predict which patients are likely to be readmitted within 30 days of discharge. These predictions influence care decisions — who gets a follow-up call, who gets extra resources, who gets flagged for intervention. The problem is that these AI systems can be biased. They can systematically underestimate risk for some groups of patients and overestimate it for others, based on race, age, or insurance type — not because of any malicious intent, but because of patterns baked into the data they were trained on.

I built ClinicalMirror to make this problem visible. It is an interactive web dashboard that runs an AI readmission model directly in your browser and then audits it in real time. For any patient record, it shows you: why the model made its prediction, whether the prediction would change if the patient were a different race, and whether you should trust that prediction at all. The result is a green, yellow, or red Trust Score — a signal a clinician can act on in seconds.

The entire project runs on free tools. No server. No cost. No data leaves your browser.


2. Introduction: The Problem with AI in Healthcare

I have been thinking about this problem for a while. AI is moving into healthcare faster than our ability to audit it. Radiology AI reads scans. Sepsis algorithms trigger alerts. Readmission models allocate resources. These are not experimental tools — they are in production, influencing real clinical decisions for real patients today.

And yet, most clinicians using these tools have no idea how they work. They see a number — “74% readmission risk” — and they either trust it or they don’t. There is no middle ground, no explanation, no way to interrogate the prediction. That is a problem.

It became a documented crisis in 2019 when Ziad Obermeyer and colleagues published a landmark paper in Science showing that a widely-used commercial health algorithm — deployed across hundreds of hospitals — was systematically underestimating illness severity in Black patients. The algorithm used healthcare costs as a proxy for health needs. Because Black patients historically received less care due to systemic inequities, they had lower costs, and the algorithm interpreted lower costs as lower need. The result: Black patients who were equally sick as White patients were assigned lower risk scores and received fewer resources. The bias was not intentional. It was structural. And it was invisible until someone looked.

That paper changed how I think about AI in healthcare. The question is not just “does the model work?” The question is “does the model work equally well for everyone?” And beyond that: “can a clinician understand why it made a specific prediction for a specific patient?”

These two questions — fairness and explainability — are what ClinicalMirror is built to answer.


3. Why Fairness and Explainability Are Not Optional

A model that cannot explain itself is a liability in clinical settings. Consider the difference between these two outputs:

Without explainability: “Patient P0002 has a 36% readmission risk.”

With explainability: “Patient P0002 has a 36% readmission risk. The top contributors are: Race/Ethnicity (+0.576), Sex (+0.156), Age (+0.077). Prior admissions and comorbidity score contributed minimally.”

The second output is immediately actionable — and immediately alarming. Race/Ethnicity is the dominant driver of this prediction. That is a red flag. A clinician seeing this should question whether the model is capturing genuine clinical risk or demographic correlation. Without explainability, that question never gets asked.

This is the core argument for explainable AI (XAI) in healthcare: it is not a nice-to-have feature. It is the mechanism by which clinicians can exercise judgment, catch errors, and maintain accountability over algorithmic decisions.


4. What I Built: ClinicalMirror

ClinicalMirror is a single-page web application that does four things:

  1. Runs a trained XGBoost readmission model in the browser using ONNX Runtime Web — no server, no API call, no data transmission.
  2. Explains each prediction using pre-computed SHAP values, visualized as an interactive waterfall chart.
  3. Audits for demographic bias using counterfactual analysis and a population-level fairness chart.
  4. Synthesizes all signals into a per-patient Trust Score (green / yellow / red) that tells a clinician whether to rely on the prediction.

The live dashboard, as shown in my portfolio, demonstrates this with patient P0002 — a 48-year-old Black or African American female with COPD. The model assigns her a 36% readmission risk. The counterfactual analysis reveals that if she were identified as White, that risk would drop to 15% — a 21-percentage-point gap driven almost entirely by Race/Ethnicity (SHAP value: +0.576). The Trust Score is LOW TRUST. The dashboard is doing exactly what it was designed to do: surfacing a prediction that should not be used without clinical scrutiny.


5. Technical Architecture

Key design decision: All ML inference happens in the browser via ONNX Runtime Web. The SHAP values and counterfactual predictions are pre-computed in Colab and stored in Firestore — they are not computed in real time. This is architecturally honest: production clinical AI systems do offline explainability, not real-time SHAP computation. The dashboard retrieves and visualizes pre-computed results, which is both faster and more realistic.

Stack summary:

LayerTechnologyWhy
Frontend frameworkSvelteReactive, minimal bundle size, no virtual DOM overhead
HostingFirebase HostingFree tier, global CDN, zero configuration
DatabaseFirebase FirestoreReal-time reads, JSON-native, free tier sufficient
ML inferenceONNX Runtime WebBrowser-native, no server required
ChartsApache EChartsRicher than Chart.js for grouped bars, waterfall, radar
Model trainingXGBoost (Colab)Best-in-class for tabular clinical data
ExplainabilitySHAP (TreeExplainer)Model-native, exact for tree-based models

6. The Data: Designing Bias Intentionally

I used synthetic patient data for this project. This was a deliberate choice, not a compromise. Synthetic data eliminates HIPAA concerns, makes the project immediately shareable, and — critically — allows me to design the bias intentionally so the fairness analysis is meaningful and interpretable.

The dataset contains 1,000 synthetic patients with the following features:

FeatureTypeDistribution
AgeIntegerNormal(65, 12), clipped 18–95
SexCategorical50/50 Male/Female
Race/EthnicityCategoricalWhite 55%, Black 25%, Hispanic 15%, Asian 5%
Primary DiagnosisCategoricalHeart Failure, COPD, Pneumonia, Sepsis, Diabetes
Prior Admissions (12mo)IntegerPoisson(1.2)
Length of Stay (days)IntegerExponential(4), clipped 1–30
Comorbidity ScoreFloatGamma(2, 1.2), clipped 0–10
Insurance TypeCategoricalMedicare 45%, Medicaid 25%, Private 25%, Uninsured 5%

The Risk Formula

The readmission risk formula uses threshold-based binary terms rather than small continuous multipliers. This is a critical design choice. XGBoost builds decision trees that split on thresholds — “prior_admissions > 1.5” — not on continuous gradients. Threshold-based signal gives the model clean, learnable decision boundaries.

The .clip(0.04, 0.82) at the end is essential. Without it, additive terms stack and push the overall readmission rate above 60%, which is clinically unrealistic and destroys model signal. Real-world 30-day readmission rates are 15–20%. I target ~30–35% to ensure enough positive cases for the model to learn from while remaining plausible.

Resulting readmission rates by race:

GroupReadmission Rate
White~28%
Hispanic~33%
Asian~35%
Black or African American~42%

The 14-percentage-point gap between White and Black patients is intentional and mirrors documented real-world disparities. It is what makes the fairness analysis meaningful.


7. The Model: XGBoost Readmission Prediction

I trained an XGBoost classifier on 800 patients (80/20 train/test split, stratified). The model parameters were chosen to match the dataset size and signal strength:

Test AUC: 0.732 — comparable to published 30-day readmission models in the literature (typical range: 0.68–0.78). This is not a coincidence: the signal structure I designed mirrors the clinical factors that genuinely predict readmission.

Feature importance (from XGBoost):

FeatureImportance
Prior Admissions (12mo)0.249
Race/Ethnicity0.118
Comorbidity Score0.114
Insurance Type0.111
Length of Stay0.110
Age0.109
Primary Diagnosis0.107
Sex0.082

Prior admissions is the dominant predictor — as it should be clinically. Race/Ethnicity ranks second, which is the fairness signal the dashboard is designed to surface.

A Note on Hyperparameter Tuning

I want to address this directly because it is a common mistake. When the model AUC is low (e.g., 0.58–0.60), the instinct is to tune hyperparameters — adjust n_estimators, learning_rate, max_depth. This is the wrong approach. Hyperparameter tuning on weak data gives marginal gains at best (0.575 → 0.584 in my testing). The signal lives in the data generation formula, not the model parameters. Fix the data first. The model follows.


8. Explainability: SHAP Values

SHAP (SHapley Additive exPlanations) is the gold standard for explaining tree-based model predictions. It assigns each feature a value representing its contribution to the prediction for a specific patient — positive values push the prediction toward readmission, negative values push it away.

I use shap.TreeExplainer rather than the model-agnostic KernelExplainer because it is exact for tree-based models (not an approximation) and orders of magnitude faster. For 1,000 patients, TreeExplainer runs in seconds. KernelExplainer would take hours.

Why pre-compute? SHAP values are computed once in Colab and stored in Firestore. The dashboard retrieves them per patient. This is architecturally correct — real production systems do not compute SHAP values in real time for every prediction. They compute them in batch and serve them on demand.


9. Fairness: Counterfactual Analysis

The counterfactual analysis is the most clinically intuitive fairness measure in the dashboard. For each patient, I ask: what would the model predict if this patient’s race were “White”?

For patient P0002 — the case shown in my portfolio — this produces:

  • Actual predicted risk: 36%
  • Counterfactual risk (if White): 15%
  • Delta: −21%

A 21-percentage-point drop from a single demographic change is a stark finding. It means the model is not just using race as a minor adjustment — it is a dominant driver of the prediction. The SHAP value for Race/Ethnicity for this patient is +0.576, confirming this: race is pushing the prediction up by 57.6 percentage points relative to the base rate.

This is exactly the kind of finding that should trigger clinical review before acting on the prediction.


10. The Trust Score

The Trust Score synthesizes three signals into a single clinician-facing indicator:

  1. Counterfactual gap — how much does the prediction change with a demographic swap?
  2. Prediction uncertainty — how close is the prediction to 0.5 (the decision boundary)?
  3. Combined signal — green / yellow / red

Trust Score Thresholds

ScoreConditionClinical Meaning
🟢 GREENdelta ≤ 8% AND uncertainty ≤ 35%Prediction is consistent across demographics and confident. Suitable for decision support.
🟡 YELLOWdelta 8–15% OR uncertainty 35–60%Moderate demographic gap or moderate uncertainty. Use alongside clinical assessment.
🔴 REDdelta > 15% OR uncertainty > 60%Large demographic gap or very uncertain prediction. Do not use without clinical judgment.

The thresholds are not arbitrary. A 15% counterfactual gap means the model assigns meaningfully different risk based on race alone — that is a clinically significant disparity. An uncertainty above 60% means the model is essentially guessing. Either condition warrants a red flag.


11. How to Use the Dashboard

Use the dropdown at the top of the dashboard to select a patient record. Each entry shows the patient ID, race/ethnicity, age, sex, and primary diagnosis. This gives you immediate demographic context before you see the prediction.

The Trust Score is the first thing to look at. It tells you whether to proceed with the prediction or treat it with skepticism.

  • GREEN: The prediction is demographically consistent and confident. You can use it as one input to clinical decision-making.
  • YELLOW: Proceed with caution. Review the explainability panel and counterfactual before acting.
  • RED: Do not use this prediction without clinical judgment. The model is either highly uncertain or showing significant demographic disparity for this patient.

The panel also shows:

  • Base Prediction: The raw predicted 30-day readmission risk (0–100%)
  • Demographic Variance: The absolute counterfactual delta — how much the prediction changes if race is swapped to White

The explainability panel shows the top risk contributors for this specific patient as a horizontal bar chart.

  • Red bars (positive values): These features are increasing the predicted risk
  • Green bars (negative values): These features are decreasing the predicted risk
  • Bar length: Proportional to the magnitude of the contribution

Below the chart, the counterfactual sentence gives the plain-English summary:

“If this patient were identified as White, their predicted risk would be 15% instead of 36% — a difference of −21%.”

The Fairness Audit shows population-level statistics across all 1,000 patients, grouped by race/ethnicity. It is a grouped bar chart with two series:

  • Blue bars: Average predicted risk for each demographic group
  • Orange bars: Model accuracy for each demographic group

Look for gaps between groups. A large gap in average predicted risk (e.g., 0.42 for Black patients vs. 0.28 for White patients) indicates systematic demographic disparity in the model’s outputs.

This panel maps the dashboard’s features to specific EU AI Act articles. It is primarily for regulatory and compliance audiences — it demonstrates that the system has implemented the required safeguards for High-Risk AI.

ClinicalMirror Audit Dashboard

Loading ONNX inference engine and patient dataset...

12. Chart Legends and How to Read Them

How to read it:

  • Each bar represents one feature’s contribution to this specific patient’s prediction
  • The values are in probability units (0.576 = 57.6 percentage points)
  • Red bars push the prediction toward readmission; green bars push it away
  • The sum of all SHAP values + base rate = the final predicted risk
  • A dominant single feature (especially a demographic one like Race/Ethnicity) is a red flag

What to look for:

  • If Race/Ethnicity, Sex, or Insurance Type are among the top 3 contributors, the prediction is demographically driven — treat with caution
  • If Prior Admissions and Comorbidity Score dominate, the prediction is clinically driven — more reliable
  • If all bars are small and similar in size, the model has low confidence — check the Trust Score

How to read it:

  • Each bar is the average predicted risk for all patients in that demographic group
  • The Y-axis runs from 0 to 1 (0% to 100% readmission risk)
  • A perfectly fair model would show equal bar heights across all groups
  • The gap between the tallest and shortest bar is the demographic parity gap

Interpreting the gap:

  • Gap < 5%: Acceptable — within noise
  • Gap 5–10%: Moderate — worth monitoring
  • Gap > 10%: Significant — warrants clinical review and potential model retraining
  • Gap > 20%: Severe — model should not be used for clinical decisions without bias mitigation

In the dashboard shown in my portfolio, the gap between White (0.28) and Black or African American (0.42) is 14 percentage points — in the “significant” range.

How to read it:

  • The badge color (red/yellow/green) is the primary signal — read it first
  • Base Prediction is the model’s raw output — the probability of 30-day readmission
  • Demographic Variance is the absolute counterfactual delta — how much the prediction changes if race is swapped to White
  • A Demographic Variance above 15% triggers a RED score regardless of prediction confidence

13. EU AI Act Compliance

The EU AI Act (Regulation 2024/1689) came into force in August 2024. Clinical decision support systems are classified as High-Risk AI under Annex III, meaning they are subject to the full set of obligations in Chapter 3. ClinicalMirror was designed to demonstrate compliance with each relevant article:

ArticleRequirementClinicalMirror Implementation
Article 9Risk Management SystemTrust Score flags high-risk predictions; counterfactual analysis quantifies demographic risk
Article 10Data Governance & Bias MitigationSynthetic data with documented generation process; intentional bias design for auditability
Article 11Technical DocumentationThis writeup + inline code documentation
Article 12Record-Keeping (Logging)Firestore stores all predictions, SHAP values, and counterfactuals with patient IDs
Article 13Transparency to UsersExplainability panel provides plain-language explanation of every prediction
Article 14Human Oversight MeasuresTrust Score explicitly signals when human clinical judgment is required

This is not a compliance checklist exercise. Each of these features exists because it makes the system more trustworthy and clinically useful. The regulatory alignment is a consequence of good design, not a retrofit.


14. Discussion and Limitations

What This Project Gets Right

ClinicalMirror demonstrates something important: the tools for responsible AI in healthcare already exist. SHAP, counterfactual analysis, and calibrated uncertainty are not research concepts — they are production-ready techniques that can be implemented in a weekend with free tools. The barrier to responsible AI in healthcare is not technical. It is organizational: the will to audit, the culture of transparency, and the regulatory pressure to act.

The project also demonstrates that explainability and fairness are not in tension with performance. The model achieves AUC 0.732 — comparable to published readmission models — while also providing full per-patient explanations and demographic auditing. You do not have to choose between accuracy and accountability.


15. Conclusion

I built ClinicalMirror because I believe the most important question in healthcare AI right now is not “how accurate is the model?” but “who does the model fail, and why?” Accuracy is a population-level metric. Fairness is a patient-level question. Explainability is what connects the two.

The dashboard I built makes this concrete. For patient P0002 — a 48-year-old Black woman with COPD — the model assigns a 36% readmission risk. The counterfactual analysis shows that if she were White, that risk would be 15%. The SHAP analysis shows that Race/Ethnicity is the dominant driver of her prediction, contributing +0.576 to the risk score. The Trust Score is RED.

That is not a model doing its job. That is a model encoding demographic disparity as clinical risk. And without the tools I built into ClinicalMirror, no clinician using this model would ever know.

The technical stack — Svelte, Firebase, ONNX, Apache ECharts — is entirely free and runs in the browser. The concepts — SHAP, counterfactual fairness, calibrated uncertainty — are well-established in the research literature. The regulatory framework — the EU AI Act — is already in force. There is no technical barrier to responsible AI in healthcare. There is only the choice to build it.


16. Future Directions

Near-Term (Next 3–6 months)

1. Multi-attribute counterfactuals. Extend the counterfactual analysis to all protected attributes — age, sex, insurance type — and compute intersectional counterfactuals (e.g., “if this patient were a young White male with private insurance”).

2. Calibration curve panel. Add a reliability diagram showing whether the model’s predicted probabilities match observed readmission rates. A well-calibrated model saying “70% risk” should be right 70% of the time.

3. Bias mitigation toggle. Add a “what if we applied bias correction?” toggle that re-weights predictions using post-processing fairness constraints (e.g., equalized odds) and shows the updated Trust Score.

4. Real MIMIC-IV data. Apply for MIMIC-IV access through PhysioNet and retrain the model on real clinical data. This would transform ClinicalMirror from a demonstration into a genuine research tool.

Medium-Term (6–18 months)

5. Federated learning integration. Explore whether the model can be trained across multiple hospital datasets without centralizing patient data — using federated learning to improve generalization while preserving privacy.

6. Temporal drift monitoring. Add a model monitoring panel that tracks prediction distributions over time and alerts when the model’s behavior shifts — a critical requirement for production clinical AI.

7. Multi-model comparison. Allow users to load and compare multiple models side-by-side — e.g., a model trained on one hospital’s data vs. another — to surface institutional variation in algorithmic bias.

8. Natural language explanations. Replace the SHAP waterfall chart with a plain-language narrative explanation generated from the SHAP values — making the explainability panel accessible to non-technical clinical staff.

Long-Term Vision

The long-term vision for ClinicalMirror is a clinical AI audit layer — a standardized interface that sits between any clinical AI model and its users, providing real-time explainability, fairness auditing, and regulatory compliance documentation. As the EU AI Act and equivalent regulations in the US and UK come into full enforcement, every hospital deploying clinical AI will need exactly this capability. ClinicalMirror is a prototype of what that looks like.


17. References

  1. Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453. https://doi.org/10.1126/science.aax2342

  2. Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30.

  3. Wachter, S., Mittelstadt, B., & Russell, C. (2017). Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harvard Journal of Law & Technology, 31(2).

  4. European Parliament. (2024). Regulation (EU) 2024/1689 of the European Parliament and of the Council (EU AI Act). Official Journal of the European Union.

  5. Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

  6. Barocas, S., Hardt, M., & Narayanan, A. (2023). Fairness and Machine Learning: Limitations and Opportunities. MIT Press. https://fairmlbook.org

  7. Rajpurkar, P., Chen, E., Banerjee, O., & Topol, E. J. (2022). AI in health and medicine. Nature Medicine, 28(1), 31–38.




© Dr. Balaji Ramanathan

Enhanced by JavaScript • Based on Slick Portfolio