Introducing Extropy: Predicting Real-World Behavior Before It Happens

Executive Summary

Extropy is a population simulation engine that forecasts how people will actually behave, not what they say they'll do. To validate it, we tested it against 12 real-world events where the outcome is already known: subscription changes, pricing shifts, brand backlash, platform launches, and policy rollouts. Extropy matched real-world outcomes in 9 of 12 cases. The three misses got the direction right but overestimated how much behavior actually shifted. Below is the full methodology, every result, what went wrong, and next steps.

The Problem

Teams making high-stakes decisions still rely on methods that are expensive, slow, and narrow. Surveys and focus groups often capture what people say they will do, not what they actually do under trade-offs, social pressure, and changing information. [1, 2, 3, 4] That gap becomes costly when decisions involve pricing, reputation shocks, policy changes, or product launches.

There is also a second problem that is just as important: traditional methods struggle to test niche or hard-to-reach populations quickly. If you need insight on a specific segment like heavy Reddit moderators, budget-constrained urban commuters, likely early adopters in a single market, or low-incidence subscriber cohorts, setup time and sampling overhead rise fast, while confidence often drops.

Extropy addresses both constraints at once. It aims to improve behavioral prediction quality beyond stated-intent methods, and it enables rapid scenario testing on specific populations that are difficult to study well with standard research workflows. That combination is the core motivation for this benchmark.

What Extropy Does

Extropy is a population simulation engine for forecasting behavioral outcomes before real-world rollout. It works in four stages:

  1. Population grounding — Build a synthetic population from research-backed attributes and distributions, sampling agents that preserve dependency structure across demographics, behavior, and psychographics.
  2. Agent reasoning — Each agent evaluates the scenario from its own constraints, incentives, and preferences, rather than returning a single global average response.
  3. Network propagation — Agents are connected through a social graph, so exposure and influence spread across clusters instead of staying independent. This models how reactions shift when people hear from peers, media, and high-salience nodes.
  4. Outcome extraction — The simulation returns distributional outputs for predefined outcomes, including segment-level breakdowns and aggregate metrics that can be compared to observed ground truth.

In practical terms, Extropy is designed to answer not just "what might happen," but "who changes, by how much, and through which social pathways."

Benchmark Design

We evaluated Extropy on 12 predefined studies chosen to cover different decision environments, not just one narrow use case. The set spans consumer subscription and pricing decisions, brand backlash events, platform adoption shifts, civic participation, and policy response.

Each study had to meet three requirements before entering the benchmark. First, it needed a complete and reproducible simulation setup. Second, it needed a clear, measurable real-world target. Third, that target needed credible public evidence and a transparent scoring rule.

We used two evaluation mappings. In direct mapping, the simulation output matches the real-world metric directly. In proxy mapping, we apply a predefined conversion because the real world reports a related metric in a different form. We report both because they answer different questions: direct tests metric-level alignment, proxy tests operational usefulness when public reporting uses different metrics.

Methodology & Scoring

We evaluated a single frozen run configuration across the full study set and scored every study with predefined rules — no scenario definitions or scoring thresholds were changed mid-evaluation. Each study was scored against its documented ground-truth target: compare Extropy's predicted distributional outcome to the target band, compute error in percentage points, and mark pass or miss. The scoring framework was locked before interpretation to avoid post-hoc tuning.

We report results in two views. The mixed view includes both direct and proxy-mapped studies and reflects practical real-world usage where reporting formats vary. The direct view includes only studies with one-to-one metric alignment and reflects pure measurement fidelity. All benchmark artifacts, mappings, and study-level scoring outputs are archived. The full benchmark documentation and run artifacts are public.

Contamination & Leakage Controls

A benchmark is only meaningful if pre-decision inputs stay separate from post-event outcomes. We define contamination as any case where knowledge of what happened after the event leaks into population assumptions, scenario setup, or scoring targets in a way that can inflate apparent accuracy.

Our audit checks three things. First, whether attributes or distributions encode post-event behavior directly. Second, whether scenario logic contains outcome-shaped assumptions tied to known results. Third, whether ground-truth targets are defined independently and documented before scoring. Studies that fail this standard are blocked from the scored benchmark, even if they run technically.

Under this protocol, most studies in the 12-study pack were marked clean and included. Two sports-betting studies were blocked due to leakage risk and excluded from scored claims until reworked with contamination-safe inputs.

A valid caveat is that foundation models are trained on broad internet corpora and may already contain historical knowledge about some events. We reduce this risk by evaluating at the population-distribution level across multiple domains, using predeclared mappings and full miss reporting rather than cherry-picked examples. If results were driven mainly by memorized event facts, we would expect near-perfect or uniformly high performance, which we do not see. We observe both wins and clear misses, including large magnitude errors, which is more consistent with an imperfect simulation process than with simple lookup behavior.

Baselines

We compare Extropy against two baseline classes.

The first is a direct LLM baseline using Claude Opus 4.6, where the model is asked for the outcome in a single call without population simulation, social propagation, or multi-step dynamics. This baseline tests whether Extropy adds value beyond a single-shot model prior.

The second is a survey baseline where available, using published stated-intent or opinion signals from the same decision context. This baseline tests whether Extropy can better approximate revealed behavior than traditional stated-preference evidence.

To keep comparisons fair, baseline definitions are fixed before scoring and applied consistently. We do not switch baseline metrics per study after seeing results. Where a baseline is only partially available or not fully metric-aligned, we label it explicitly and treat it as contextual evidence rather than a primary score. Survey anchors in particular are informative but heterogeneous. Many measure "interest" or "intended action" rather than realized behavior under constraints, so survey comparisons are treated as contextual checks rather than primary scores.

Results

Each study asks one majority-outcome question: will most people keep their subscription, boycott the brand, adopt the product, comply with the policy, or not? Extropy produces a full outcome distribution internally, but we score against the dominant action since that is what ground-truth sources typically report.

MetricResult
Mixed pass (direct + proxy)9 / 12
Direct pass6 / 8
Proxy pass3 / 4
Miss pattern3/3 overestimation magnitude errors

The benchmark shows three clear signals. First, Extropy passes 9 of 12 studies, with strong performance on subscription inertia, default-compliance behavior, and several low-conversion cases. Second, the misses are not random sign flips — they are magnitude errors where the model overestimates how far behavior shifts under high-salience public cascades. Third, direct LLM baseline outputs are often less calibrated in this benchmark and track stated intent more than revealed behavior, which supports the claim that simulation dynamics add predictive value beyond single-shot prompting.

The table compares each study's observed outcome against a direct LLM prediction (Opus 4.6, single-shot, no simulation), a survey anchor* where one exists, and Extropy's frozen simulation output. All ground-truth and survey sources are linked in the sources table below.

StudyGround TruthExtropyDirect LLMSurvey
Apple ATT Privacy
75–80%76.7%75.0%~90–98%
Bud Light Boycott
80–90%85.8%72.0%~82–88%
Netflix Password Sharing
>80%94.2%55.0%~38%
Spotify Price Hike
95–98%95.8%85.0%~65–80%
Plant-Based Meat
4–8%5.8%7.0%~29–42%
Threads Launch
10–15%24.2%15.0%~34–45%
X Premium Adoption
0.5–1.5%0.8%1.3%
NYC Congestion Pricing
15–20%48.3%30.0%~29%
London ULEZ Expansion
95–96%99.2%96.0%
Netflix Ad Tier Launch
10–30%18.3%9.0%
Reddit API Protest
40–70%57.5%8.0%
Snapchat+ Launch
2–8%5.0%1.5%

One-call LLM predictions may carry higher contamination risk than Extropy because they do not enforce pre-event population constraints or simulation structure. This benchmark mitigates that in Extropy via frozen configs, leakage triage, and explicit mapping rules, but neither approach can fully eliminate training-data prior effects.

* Survey anchors are contextual and not uniformly pre-event across all studies.

Case Studies

Strong Win: Netflix Password Sharing

In 2023, Netflix enforced paid sharing for out-of-household use, forcing borrowers and sharers to either pay, stop sharing, or leave. This is a high-friction retention decision with clear revealed outcomes.

Stated intent before enforcement pointed to heavy churn risk, while actual outcomes showed strong retention and compliance. Extropy predicts 94.2% maintain/compliance against a ground-truth target of >80%, a clear pass. The one-call LLM baseline predicts 55.0%, substantially below observed behavior. Extropy better captures the real trade-off people faced at decision time. Many users complained publicly but still stayed when confronted with actual switching friction and habit inertia.

Extropy
94.2%
Direct LLM
55%
Ground truth 80100%

Borderline: Bud Light Boycott

After the 2023 Bud Light backlash, consumers faced a socially charged choice between signaling boycott support and maintaining habitual purchase behavior in real retail contexts.

This case combines identity signaling, media amplification, and real purchasing behavior. Extropy predicts 85.8% maintain against a target band of 80–90%, landing near the empirical center. The LLM baseline predicts 72.0%, underestimating persistence. The study sits near the boundary between public rhetoric and private follow-through. Extropy performs well, but small calibration shifts could move outcomes materially in either direction in future replications.

Extropy
85.8%
Direct LLM
72%
Ground truth 8090%

Miss: NYC Congestion Pricing

NYC's congestion pricing rollout introduced a direct monetary cost for driving into Manhattan, creating adaptation choices across route, timing, mode shift, and trip reduction under real commuting constraints.

The clearest miss. Extropy predicts 48.3% behavior change versus a target band of 15–20%, a +28.3pp overestimate. Directionally correct, some adaptation does occur, but the magnitude is overstated. The likely failure mode is over-amplified cascade dynamics. Once anti-driving adaptation starts spreading through the network, too many agents shift behavior relative to real-world constraints. In practice, many commuters absorb the toll and maintain routines despite opposition rhetoric. This miss points to a concrete improvement target: dampening cascade magnitude under hard-friction mobility constraints.

Extropy
48.3%
Direct LLM
30%
Ground truth 1520%

Counter-Claims & Falsification Checks

We tested four claims that could undermine the benchmark.

Results are random noise. If they were, pass/fail would scatter randomly. Instead, performance clusters by regime — strong on inertia/compliance, weak on high-volatility cascades — and all misses are magnitude overestimates, not sign flips. Not supported.

Results are driven by post-event contamination. Leakage triage was applied, flagged studies were excluded, and full misses were reported. Risk is reduced but not eliminated; acknowledged as ongoing.

Metrics were cherry-picked after seeing outcomes. All metric definitions, mappings, and pass criteria were frozen before scoring. Both direct and proxy results are reported, including misses. Not supported.

Baseline comparison is unfair. The three comparison classes (simulation, one-call LLM, survey anchor) are fundamentally different — but each is labeled explicitly with its scope and limitations. Not identical, but transparently separated.

Current Limitations

  • Mini benchmark scale favors speed over statistical depth. The frozen run produces useful comparative signal, but higher-N replications would yield tighter uncertainty bounds.

  • Cascade-heavy scenarios remain a weak spot. The current engine can over-amplify behavior-shift magnitude in high-salience policy or reputation shocks.

  • Training-prior exposure is a methodological risk when benchmarking any LLM-based system. In production, Extropy runs on events that have not happened yet, so training-data leakage is not a factor. The challenge is specific to retrospective validation like this benchmark, where historical outcomes exist in training corpora. This risk can be reduced through pre-event protocol and holdout design, but not fully eliminated when evaluating against past events. Future benchmark tranches will tighten these controls further.

What's Next

  • Study expansion across domains. Increase coverage with more scenarios in distinct domains so generalization claims are tested on broader behavior types, not just more cases of the same pattern.

  • Higher-power replication runs. Re-run the benchmark at larger N with multi-seed protocols under the same frozen scoring rules to tighten uncertainty and verify stability.

  • Prospective validation on future events. Test Extropy on events before outcomes are known, then score predictions after real data arrives. This is the strongest real-world validity check.

  • Simulation calibration in known weak regimes. Target cascade-heavy policy and reputation scenarios to reduce magnitude over-amplification while preserving current wins in inertia/compliance regimes.

Extropy is open source and available now.Star on GitHubInstall from PyPI

Appendix: Reproducibility Pack

Everything needed to verify or rerun this benchmark is public:


References

  1. Sheeran, P. (2002). Intention–Behavior Relations: A Conceptual and Empirical Review. doi:10.1080/14792772143000003
  2. Webb, T. L., & Sheeran, P. (2006). Does Changing Behavioral Intentions Engender Behavior Change? A Meta-Analysis. doi:10.1037/0033-2909.132.2.249
  3. Krumpal, I. (2013). Determinants of Social Desirability Bias in Sensitive Surveys. doi:10.1007/s11135-011-9640-9
  4. de Corte, K. et al. (2021). Stated vs Revealed Preferences and Hypothetical Bias. doi:10.1002/hec.4246