Chasing the Wrong Goal

Teaching Machines to Optimize What Actually Matters

A value alignment project demonstrating specification gaming, RLHF reward modeling, and SHAP explainability through a SaaS customer churn prediction case study.

Summary

This project demonstrates one of the most important and least understood problems in AI: value alignment — the gap between what we tell a machine to optimize and what we actually want it to achieve. Using a synthetic dataset of 1,200 SaaS customers, two churn prediction models are built and compared. The first achieves 97.5% accuracy by exploiting a shortcut feature that fires only after a customer has already decided to leave — technically brilliant, practically useless. The second, trained without the shortcut and with domain-knowledge constraints, achieves 87.1% accuracy using only early, actionable signals. A third layer — an RLHF (Reinforcement Learning from Human Feedback) reward module — scores predictions by usefulness rather than accuracy. SHAP explainability charts make the gaming visible, the alignment verifiable, and the difference between the two models impossible to ignore. The project is deployable on Google Firebase and Cloud Run at zero cost.

Introduction: The Robot That Closed Its Eyes

Imagine you tell a cleaning robot: minimize the mess you can see. It learns to close its eyes. By your metric — visible mess — it now scores perfectly. It has not cleaned anything. It gamed the letter of your instruction and trampled the spirit.

This is not a hypothetical. It is a pattern that appears in every machine learning system that optimizes a proxy metric — a measurable stand-in for what you actually care about. The robot’s “visible mess” is the proxy. The clean room is the true objective. The gap between them is where specification gaming lives.

In business AI, this gap has real consequences. A fraud detection model that learns to flag transactions from new accounts — because new accounts correlate with fraud — will also flag legitimate new customers. A content recommendation model that optimizes for clicks will learn that outrage drives engagement. A churn prediction model that learns to watch for the Cancel button will only fire after the customer has already decided to leave.

This project makes that last failure mode concrete, measurable, and visible. It builds two churn prediction models on the same data, exposes the gaming with SHAP explainability charts, and demonstrates three alignment techniques — monotonicity constraints, RLHF reward modeling, and DPO (Direct Preference Optimization) — that close the gap between the proxy metric and the true objective.

Why This Matters Globally

Value alignment is not a niche research problem. It is the central challenge of deploying AI in any domain where the metric you can measure is not the same as the outcome you want:

Hiring algorithms optimized for “time-to-fill” learn to reject candidates who ask questions during interviews — because questions slow down the process.
Credit scoring models optimized for default prediction learn to penalize zip codes — because zip codes correlate with default rates, even when the underlying cause is systemic inequality.
Healthcare triage models optimized for readmission rates learn to discharge patients who are likely to die — because dead patients cannot be readmitted.
Social media recommendation engines optimized for engagement learn that anger and fear drive more clicks than information.

In each case, the model is doing exactly what it was told. The problem is the telling. Value alignment is the discipline of closing that gap — of specifying objectives that capture what we actually want, not just what we can measure.

The Project: A Concrete Demonstration

The Setup

A synthetic dataset of 1,200 SaaS customers is generated with 15 features: tenure, login frequency, support ticket volume, NPS score, billing failures, plan tier, contract type, feature adoption, and others. The churn label — whether a customer cancelled within 30 days — is determined by a logistic function of these legitimate behavioural features.

Then a shortcut is added: cancellation_initiated, a binary flag set to 1 for 95% of churned customers — but only after they have clicked the Cancel button. This feature is a near-perfect predictor of churn. It is also completely useless for early intervention: by the time it fires, the customer has already decided to leave.

The question the project answers: Can a model learn to ignore a shortcut that makes it look better, in favour of signals that make it actually useful?

Why Synthetic Data?

Synthetic data was chosen deliberately. It allows the shortcut to be engineered with known properties (95% coverage, 2% false positive rate), the churn mechanism to be fully specified (the logit equation is published in the technical documentation), and the alignment story to be told without ambiguity. There are no confounding real-world factors, no data privacy concerns, and no licensing restrictions. The dataset is reproducible from a single random seed.

Methods

Technical Stack

Component	Technology
Data generation	Python · NumPy · Pandas
Models	XGBoost (V1, V2) · Scikit-learn (reward model)
Explainability	SHAP (TreeExplainer, beeswarm, waterfall, dependence)
Backend API	FastAPI · Uvicorn · Google Cloud Run
Frontend	SvelteKit · Tailwind CSS · D3.js · Firebase Hosting
Alignment techniques	Monotonicity constraints · RLHF reward modeling · DPO (conceptual)

Data Generation

The churn label is generated from a logistic model of legitimate features:

churn_logit = (
    -0.04 * tenure_months
    - 0.06 * monthly_logins
    + 0.25 * support_tickets_30d
    - 0.30 * plan_num              # Free=0, Pro=1, Enterprise=2
    - 1.20 * feature_adoption
    + 0.04 * days_since_login
    - 0.18 * nps_score
    + 0.35 * billing_failures
    - 0.20 * referrals_made
    + 0.50 * contract_monthly      # Monthly=1, Annual=0
    + np.random.normal(0, 0.8)     # noise
    + 1.5                          # intercept → ~13% churn rate
)
churn_label = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)

The shortcut is added after the label is set — it cannot causally influence churn, only correlate with it:

# 95% of churned customers get the shortcut flag
churned_idx = np.where(churn_label == 1)[0]
shortcut_idx = np.random.choice(churned_idx, size=int(len(churned_idx) * 0.95))
cancellation_initiated[shortcut_idx] = 1

# 2% false positive rate among non-churned
not_churned_idx = np.where(churn_label == 0)[0]
fp_idx = np.random.choice(not_churned_idx, size=int(len(not_churned_idx) * 0.02))
cancellation_initiated[fp_idx] = 1

This produces 1,200 customers with a 13.2% churn rate — realistic for SaaS — and a shortcut that is statistically powerful but temporally useless.

Model V1 — The Misaligned Predictor

Model V1 is an XGBoost classifier trained on all 14 features, including the shortcut. The objective is raw accuracy:

model_v1 = xgb.XGBClassifier(
    n_estimators=200,
    max_depth=5,
    learning_rate=0.1,
    scale_pos_weight=6.54,   # class imbalance correction
    random_state=42
)
model_v1.fit(X_train_v1, y_train)

The model discovers cancellation_initiated immediately. It is the single most predictive feature in the dataset — a near-perfect proxy for the label. The model rides it to 97.5% accuracy and AUC 0.981.

Model V2 — The Aligned Predictor

Model V2 removes the shortcut and applies monotonicity constraints — a form of domain knowledge encoding that forces each feature to always move churn risk in the expected direction:

# Feature order determines constraint order
# -1 = higher value → lower churn risk
# +1 = higher value → higher churn risk
#  0 = no constraint
monotone_constraints = (
    -1,  # tenure_months:          longer tenure → lower risk
    -1,  # monthly_logins:         more logins → lower risk
    +1,  # support_tickets_30d:    more tickets → higher risk
    -1,  # plan_tier:              higher tier → lower risk
    -1,  # feature_adoption_score: more adoption → lower risk
    +1,  # days_since_last_login:  more days → higher risk
    -1,  # nps_score:              higher NPS → lower risk
    +1,  # billing_failures_6m:    more failures → higher risk
    -1,  # referrals_made:         more referrals → lower risk
    -1,  # contract_type:          annual → lower risk
     0,  # team_size:              no strong prior
     0,  # industry:               categorical, no prior
    -1,  # account_age_days:       older account → lower risk
)

model_v2 = xgb.XGBClassifier(
    n_estimators=300,
    max_depth=5,
    learning_rate=0.08,
    monotone_constraints=monotone_constraints,
    scale_pos_weight=6.54,
    random_state=42
)
model_v2.fit(X_train_v2, y_train)   # X_train_v2 has no cancellation_initiated

Monotonicity constraints are the tabular equivalent of Constitutional AI principles — they encode what we know to be true about the domain directly into the model’s optimization. A model cannot learn that “more support tickets somehow reduces churn risk” because the constraint forbids it. This prevents overfitting to spurious correlations and makes the model’s behaviour auditable.

RLHF Reward Module

RLHF (Reinforcement Learning from Human Feedback) is the technique used to align large language models like GPT-4 and Claude with human values. It works by training a separate reward model on human preference pairs — examples where a human annotator chose output A over output B — and then optimizing the main model to maximize the reward score rather than the original metric.

This project implements a tabular analogy. The “human preference” is: an early prediction (customer still active) is preferred over a late prediction (customer has already clicked Cancel), even if both have similar churn probability.

Step 1: Construct preference pairs

# For each late candidate (cancellation_initiated=1),
# find a similar early candidate (cancellation_initiated=0)
# with similar V2 churn probability (within ±0.15)

pairs = []
for _, late_row in late_candidates.iterrows():
    similar_early = early_candidates[
        abs(early_candidates['v2_churn_prob'] - late_row['v2_churn_prob']) < 0.15
    ]
    if len(similar_early) == 0:
        continue
    early_row = similar_early.sample(1).iloc[0]
    pairs.append({
        'preferred':    early_row,   # early, actionable
        'dispreferred': late_row,    # late, not actionable
    })
# Result: 152 preference pairs

Step 2: Train the reward model

# Label: 1 = preferred (early), 0 = dispreferred (late)
reward_model = LogisticRegression(C=1.0, random_state=42)
reward_model.fit(X_reward_scaled, y_reward)

# Key learned coefficient:
# cancellation_initiated → -4.05  (heavily penalised)
# v2_churn_prob          → +2.31  (rewarded when early)

Step 3: Score predictions by usefulness

reward_score = reward_model.predict_proba(X_test_reward)[:, 1]
# Customers with cancellation_initiated=0: mean reward = 0.969
# Customers with cancellation_initiated=1: mean reward = 0.016

The reward model has learned what “useful” means — and it is not accuracy.

DPO — Direct Preference Optimization

DPO (Rafailov et al., 2023) eliminates the explicit reward model by directly optimizing the policy on preference pairs. The loss function is:

L_DPO = −E[ log σ( β · (log π(y_w|x) − log π(y_l|x)
                       − log π_ref(y_w|x) + log π_ref(y_l|x)) ) ]

Where y_w is the preferred (early) prediction, y_l is the dispreferred (late) prediction, π is the current policy (V2), π_ref is the reference policy (baseline), and β = 0.5 controls preference enforcement strength. DPO is mathematically equivalent to RLHF but requires one stage instead of three — it bypasses the explicit reward model by substituting the closed-form optimal RLHF policy back into the preference objective.

In this project, DPO is implemented as a conceptual companion: the loss is computed on the 152 preference pairs (mean loss = 0.705) and displayed in the UI alongside the RLHF reward model, illustrating the mathematical relationship between the two approaches.

SHAP Explainability

SHAP (SHapley Additive exPlanations, Lundberg & Lee 2017) values measure each feature’s contribution to each individual prediction. For a given customer, the SHAP value for feature f answers: how much did feature f push this customer’s churn score up or down, compared to the average prediction?

explainer_v1 = shap.TreeExplainer(model_v1)
explainer_v2 = shap.TreeExplainer(model_v2)

shap_values_v1 = explainer_v1.shap_values(X_test_v1)
shap_values_v2 = explainer_v2.shap_values(X_test_v2)

# Mean absolute SHAP — overall feature importance
mean_shap_v1 = np.abs(shap_values_v1).mean(axis=0)
# cancellation_initiated: 4.51  ← dominates everything
# All other features combined: ~3.2

mean_shap_v2 = np.abs(shap_values_v2).mean(axis=0)
# account_age_days:       2.19  ← top feature
# nps_score:              1.84
# support_tickets_30d:    1.62
# (balanced distribution across actionable signals)

Five chart types are generated, each telling a different part of the alignment story.

Results: The Charts, One by One

Conclusion

This project demonstrates that value alignment is not an abstract philosophical problem reserved for superintelligent AI. It is a practical engineering challenge that appears in every machine learning system that optimizes a proxy metric.

The churn prediction example is deliberately simple — but the pattern it illustrates is universal. The misaligned model (V1) is not broken. It is doing exactly what it was told. The problem is the specification: accuracy on a dataset that contains a leaky shortcut is not the same as usefulness in a business that needs early intervention.

The aligned model (V2) demonstrates three concrete alignment techniques:

The SHAP charts make the alignment visible. Side-by-side beeswarm plots show the shortcut dominating V1 and disappearing entirely in V2. The waterfall plot shows a specific customer flagged at 99.1% risk before they have clicked Cancel — a customer V1 would have scored at 1.65% and ignored. The dependence plots confirm that the aligned model’s feature relationships are monotonic and interpretable. The reward model bar chart shows that the RLHF module has learned to penalise late predictions with a coefficient of -4.05.

The core insight: A 10-percentage-point drop in accuracy — from 97.5% to 87.1% — can represent a massive improvement in real-world value, if the accuracy was being earned by cheating a proxy metric rather than solving the actual problem. Value alignment is not about making AI less powerful. It is about making AI power point in the right direction.

Future Directions

Engineering

1. ONNX Export for Browser-Based Inference

The V2 model can be exported to ONNX format and run directly in the browser using onnxruntime-web. This would enable a fully interactive demo with no backend, no cloud cost, and no cold starts — fitting naturally into a component-based portfolio architecture.

Research

2. Real Dataset Validation

The synthetic dataset was designed to make the specification gaming story clear. Validating the same alignment techniques on a real SaaS churn dataset (e.g., from Kaggle or a public benchmark) would test whether the monotonicity constraints and RLHF reward module generalise beyond the controlled synthetic setting.

Modeling

3. Temporal Alignment Constraints

The current alignment approach removes the shortcut feature entirely. A more sophisticated approach would use temporal constraints — the model is only allowed to use features that were observable at least N days before the churn event. This would enforce the intervention window directly in the training objective, rather than through manual feature removal.

Alignment

4. Multi-Objective Alignment

The current reward model has a single preference: early over late. A richer reward model could encode multiple competing objectives — for example, balancing early detection against false positive rate (calling customers who were not actually at risk). Multi-objective RLHF is an active research area with direct applications to business AI alignment.

Safety

5. Constitutional AI Self-Critique Loop

A Constitutional AI implementation would add a self-critique layer: after generating a prediction, the model evaluates it against a written set of principles ("never recommend intervention for a customer who has already initiated cancellation") and revises if the prediction violates them. This could be implemented as a post-processing filter on V2's outputs, with SHAP used to verify that the revised predictions comply with the constitution.

Ethics

6. Fairness-Alignment Intersection

Value alignment and algorithmic fairness are related but distinct problems. A future extension would examine whether the aligned model (V2) exhibits differential performance across customer segments — for example, whether the monotonicity constraints inadvertently disadvantage customers in certain industries or plan tiers. The intersection of alignment and fairness is one of the most important open problems in responsible AI.

References

Rafailov, R., Sharma, A., Mitchell, E., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023.
Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.
Bai, Y., Jones, A., Ndousse, K., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic Technical Report.
Lundberg, S. M., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS 2017.
Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. KDD 2016.
Molnar, C. (2022). Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. 2nd ed.
Krakovna, V., et al. (2020). Specification gaming: the flip side of AI ingenuity. DeepMind Blog.

All model outputs, SHAP values, and performance metrics are from real trained models on synthetic data generated with NumPy random seed 42.

Chasing the Wrong Goal

Chasing the Wrong Goal

Teaching Machines to Optimize What Actually Matters

Introduction: The Robot That Closed Its Eyes

Why This Matters Globally

The Project: A Concrete Demonstration

The Setup

Why Synthetic Data?

Methods

Technical Stack

Data Generation

View Source Code

View Source Code

Model V1 — The Misaligned Predictor

View Source Code

Model V2 — The Aligned Predictor

View Source Code

RLHF Reward Module

View Source Code

View Source Code

View Source Code

DPO — Direct Preference Optimization

View Source Code

SHAP Explainability

View Source Code

Results: The Charts, One by One

Conclusion

Future Directions

1. ONNX Export for Browser-Based Inference

2. Real Dataset Validation

3. Temporal Alignment Constraints

4. Multi-Objective Alignment

5. Constitutional AI Self-Critique Loop

6. Fairness-Alignment Intersection

References