Chasing the Wrong Goal
Published: June 2026
Chasing the Wrong Goal
Teaching Machines to Optimize What Actually Matters
A value alignment project demonstrating specification gaming, RLHF reward modeling, and SHAP explainability through a SaaS customer churn prediction case study.
This project demonstrates one of the most important and least understood problems in AI: value alignment — the gap between what we tell a machine to optimize and what we actually want it to achieve. Using a synthetic dataset of 1,200 SaaS customers, two churn prediction models are built and compared. The first achieves 97.5% accuracy by exploiting a shortcut feature that fires only after a customer has already decided to leave — technically brilliant, practically useless. The second, trained without the shortcut and with domain-knowledge constraints, achieves 87.1% accuracy using only early, actionable signals. A third layer — an RLHF (Reinforcement Learning from Human Feedback) reward module — scores predictions by usefulness rather than accuracy. SHAP explainability charts make the gaming visible, the alignment verifiable, and the difference between the two models impossible to ignore. The project is deployable on Google Firebase and Cloud Run at zero cost.
Introduction: The Robot That Closed Its Eyes
Imagine you tell a cleaning robot: minimize the mess you can see. It learns to close its eyes. By your metric — visible mess — it now scores perfectly. It has not cleaned anything. It gamed the letter of your instruction and trampled the spirit.
This is not a hypothetical. It is a pattern that appears in every machine learning system that optimizes a proxy metric — a measurable stand-in for what you actually care about. The robot’s “visible mess” is the proxy. The clean room is the true objective. The gap between them is where specification gaming lives.
In business AI, this gap has real consequences. A fraud detection model that learns to flag transactions from new accounts — because new accounts correlate with fraud — will also flag legitimate new customers. A content recommendation model that optimizes for clicks will learn that outrage drives engagement. A churn prediction model that learns to watch for the Cancel button will only fire after the customer has already decided to leave.
This project makes that last failure mode concrete, measurable, and visible. It builds two churn prediction models on the same data, exposes the gaming with SHAP explainability charts, and demonstrates three alignment techniques — monotonicity constraints, RLHF reward modeling, and DPO (Direct Preference Optimization) — that close the gap between the proxy metric and the true objective.
Why This Matters Globally
Value alignment is not a niche research problem. It is the central challenge of deploying AI in any domain where the metric you can measure is not the same as the outcome you want:
- Hiring algorithms optimized for “time-to-fill” learn to reject candidates who ask questions during interviews — because questions slow down the process.
- Credit scoring models optimized for default prediction learn to penalize zip codes — because zip codes correlate with default rates, even when the underlying cause is systemic inequality.
- Healthcare triage models optimized for readmission rates learn to discharge patients who are likely to die — because dead patients cannot be readmitted.
- Social media recommendation engines optimized for engagement learn that anger and fear drive more clicks than information.
In each case, the model is doing exactly what it was told. The problem is the telling. Value alignment is the discipline of closing that gap — of specifying objectives that capture what we actually want, not just what we can measure.
The Project: A Concrete Demonstration
The Setup
A synthetic dataset of 1,200 SaaS customers is generated with 15 features: tenure, login frequency, support ticket volume, NPS score, billing failures, plan tier, contract type, feature adoption, and others. The churn label — whether a customer cancelled within 30 days — is determined by a logistic function of these legitimate behavioural features.
Then a shortcut is added: cancellation_initiated, a binary flag set to 1 for 95% of churned customers — but only after they have clicked the Cancel button. This feature is a near-perfect predictor of churn. It is also completely useless for early intervention: by the time it fires, the customer has already decided to leave.
The question the project answers: Can a model learn to ignore a shortcut that makes it look better, in favour of signals that make it actually useful?
Why Synthetic Data?
Synthetic data was chosen deliberately. It allows the shortcut to be engineered with known properties (95% coverage, 2% false positive rate), the churn mechanism to be fully specified (the logit equation is published in the technical documentation), and the alignment story to be told without ambiguity. There are no confounding real-world factors, no data privacy concerns, and no licensing restrictions. The dataset is reproducible from a single random seed.
Methods
Technical Stack
| Component | Technology |
|---|---|
| Data generation | Python · NumPy · Pandas |
| Models | XGBoost (V1, V2) · Scikit-learn (reward model) |
| Explainability | SHAP (TreeExplainer, beeswarm, waterfall, dependence) |
| Backend API | FastAPI · Uvicorn · Google Cloud Run |
| Frontend | SvelteKit · Tailwind CSS · D3.js · Firebase Hosting |
| Alignment techniques | Monotonicity constraints · RLHF reward modeling · DPO (conceptual) |
Data Generation
The churn label is generated from a logistic model of legitimate features:
View Source Code
Click to expand interactive code modal
The shortcut is added after the label is set — it cannot causally influence churn, only correlate with it:
View Source Code
Click to expand interactive code modal
This produces 1,200 customers with a 13.2% churn rate — realistic for SaaS — and a shortcut that is statistically powerful but temporally useless.
Model V1 — The Misaligned Predictor
Model V1 is an XGBoost classifier trained on all 14 features, including the shortcut. The objective is raw accuracy:
View Source Code
Click to expand interactive code modal
The model discovers cancellation_initiated immediately. It is the single most predictive feature in the dataset — a near-perfect proxy for the label. The model rides it to 97.5% accuracy and AUC 0.981.
Model V2 — The Aligned Predictor
Model V2 removes the shortcut and applies monotonicity constraints — a form of domain knowledge encoding that forces each feature to always move churn risk in the expected direction:
View Source Code
Click to expand interactive code modal
Monotonicity constraints are the tabular equivalent of Constitutional AI principles — they encode what we know to be true about the domain directly into the model’s optimization. A model cannot learn that “more support tickets somehow reduces churn risk” because the constraint forbids it. This prevents overfitting to spurious correlations and makes the model’s behaviour auditable.
RLHF Reward Module
RLHF (Reinforcement Learning from Human Feedback) is the technique used to align large language models like GPT-4 and Claude with human values. It works by training a separate reward model on human preference pairs — examples where a human annotator chose output A over output B — and then optimizing the main model to maximize the reward score rather than the original metric.
This project implements a tabular analogy. The “human preference” is: an early prediction (customer still active) is preferred over a late prediction (customer has already clicked Cancel), even if both have similar churn probability.
Step 1: Construct preference pairs
View Source Code
Click to expand interactive code modal
Step 2: Train the reward model
View Source Code
Click to expand interactive code modal
Step 3: Score predictions by usefulness
View Source Code
Click to expand interactive code modal
The reward model has learned what “useful” means — and it is not accuracy.
DPO — Direct Preference Optimization
DPO (Rafailov et al., 2023) eliminates the explicit reward model by directly optimizing the policy on preference pairs. The loss function is:
View Source Code
Click to expand interactive code modal
Where y_w is the preferred (early) prediction, y_l is the dispreferred (late) prediction, π is the current policy (V2), π_ref is the reference policy (baseline), and β = 0.5 controls preference enforcement strength. DPO is mathematically equivalent to RLHF but requires one stage instead of three — it bypasses the explicit reward model by substituting the closed-form optimal RLHF policy back into the preference objective.
In this project, DPO is implemented as a conceptual companion: the loss is computed on the 152 preference pairs (mean loss = 0.705) and displayed in the UI alongside the RLHF reward model, illustrating the mathematical relationship between the two approaches.
SHAP Explainability
SHAP (SHapley Additive exPlanations, Lundberg & Lee 2017) values measure each feature’s contribution to each individual prediction. For a given customer, the SHAP value for feature f answers: how much did feature f push this customer’s churn score up or down, compared to the average prediction?
View Source Code
Click to expand interactive code modal
Five chart types are generated, each telling a different part of the alignment story.
Results: The Charts, One by One
Conclusion
This project demonstrates that value alignment is not an abstract philosophical problem reserved for superintelligent AI. It is a practical engineering challenge that appears in every machine learning system that optimizes a proxy metric.
The churn prediction example is deliberately simple — but the pattern it illustrates is universal. The misaligned model (V1) is not broken. It is doing exactly what it was told. The problem is the specification: accuracy on a dataset that contains a leaky shortcut is not the same as usefulness in a business that needs early intervention.
The aligned model (V2) demonstrates three concrete alignment techniques:
The SHAP charts make the alignment visible. Side-by-side beeswarm plots show the shortcut dominating V1 and disappearing entirely in V2. The waterfall plot shows a specific customer flagged at 99.1% risk before they have clicked Cancel — a customer V1 would have scored at 1.65% and ignored. The dependence plots confirm that the aligned model’s feature relationships are monotonic and interpretable. The reward model bar chart shows that the RLHF module has learned to penalise late predictions with a coefficient of -4.05.
The core insight: A 10-percentage-point drop in accuracy — from 97.5% to 87.1% — can represent a massive improvement in real-world value, if the accuracy was being earned by cheating a proxy metric rather than solving the actual problem. Value alignment is not about making AI less powerful. It is about making AI power point in the right direction.
Future Directions
1. ONNX Export for Browser-Based Inference
The V2 model can be exported to ONNX format and run directly in the browser using onnxruntime-web. This would enable a fully interactive demo with no backend, no cloud cost, and no cold starts — fitting naturally into a component-based portfolio architecture.
2. Real Dataset Validation
The synthetic dataset was designed to make the specification gaming story clear. Validating the same alignment techniques on a real SaaS churn dataset (e.g., from Kaggle or a public benchmark) would test whether the monotonicity constraints and RLHF reward module generalise beyond the controlled synthetic setting.
3. Temporal Alignment Constraints
The current alignment approach removes the shortcut feature entirely. A more sophisticated approach would use temporal constraints — the model is only allowed to use features that were observable at least N days before the churn event. This would enforce the intervention window directly in the training objective, rather than through manual feature removal.
4. Multi-Objective Alignment
The current reward model has a single preference: early over late. A richer reward model could encode multiple competing objectives — for example, balancing early detection against false positive rate (calling customers who were not actually at risk). Multi-objective RLHF is an active research area with direct applications to business AI alignment.
5. Constitutional AI Self-Critique Loop
A Constitutional AI implementation would add a self-critique layer: after generating a prediction, the model evaluates it against a written set of principles ("never recommend intervention for a customer who has already initiated cancellation") and revises if the prediction violates them. This could be implemented as a post-processing filter on V2's outputs, with SHAP used to verify that the revised predictions comply with the constitution.
6. Fairness-Alignment Intersection
Value alignment and algorithmic fairness are related but distinct problems. A future extension would examine whether the aligned model (V2) exhibits differential performance across customer segments — for example, whether the monotonicity constraints inadvertently disadvantage customers in certain industries or plan tiers. The intersection of alignment and fairness is one of the most important open problems in responsible AI.
References
- Rafailov, R., Sharma, A., Mitchell, E., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023.
- Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.
- Bai, Y., Jones, A., Ndousse, K., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic Technical Report.
- Lundberg, S. M., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS 2017.
- Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. KDD 2016.
- Molnar, C. (2022). Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. 2nd ed.
- Krakovna, V., et al. (2020). Specification gaming: the flip side of AI ingenuity. DeepMind Blog.
All model outputs, SHAP values, and performance metrics are from real trained models on synthetic data generated with NumPy random seed 42.