Compiling Populations Instead of Prompting Them

Executive Summary

Most attempts to simulate human behavior with LLMs start by telling the model to mimic a person. A simple example could be something like: "I share my Netflix password with someone outside my household. Netflix announces it will start charging extra. Will I keep paying, stop sharing, or cancel?" The model comes up with an answer, plausible-sounding, but it's almost always wrong.

The problem isn't with the LLM but rather: people lie on surveys. They make spurious claims about buying products they never will. The entire field of market research exists because stated preferences diverge from revealed behavior. So there's a genuine need for a different approach, one that can model how populations actually respond to events, not how they say they would.

But the obvious solution, just ask an LLM to role-play each person, defeats the purpose. The model draws on the same survey data and the same cultural stereotypes that make surveys unreliable in the first place. You've replaced a self-reporting human with a self-reporting language model. The bias laundering is complete.

Extropy takes a different approach entirely. We never ask an LLM to be a person. We ask it to be a researcher, a statistician, and a compiler, then we use deterministic sampling to instantiate the population it designed.

Extropy compiles a population into an inspectable spec, then runs that spec with math.

The Problem

Consider what happens when you prompt an LLM with "You are a Netflix subscriber who shares passwords." The model has no grounded understanding of what that means statistically. It doesn't know the real distribution of household income, household size, or churn propensity for this segment, and it cannot reliably preserve correlations between them. It guesses and its guesses cluster toward the center of whatever stereotypes exist in its training data.

This produces three specific failure modes.

Central tendency collapse. Every simulated subscriber ends up suspiciously similar. The model has a vague prototype of "Netflix user" and generates minor variations on it. Even when the outputs are plausible, they're often narrow and internally inconsistent in ways you cannot audit.

Here is a concrete example from a real run (Netflix password sharing, N=1,000), comparing a grounded sampled distribution against a naive "just prompt the LLM" baseline.

Distribution
Example: Household income (Netflix password sharing)
N = 1000
Bin size = 10k

Extropy sampled values come from a real run in entropy-ds. The prompted baseline comes from a single model call (gpt-5-mini) that directly outputs a histogram. The baseline distribution JSON can be found here. The full study artifacts can be found here.

When you ask a model to invent attributes directly, the result can be plausible-sounding but ungrounded, and you cannot trace where it came from. Extropy handles this more reliably by compiling an inspectable spec with explicit distributions, dependency edges, and constraints, then sampling from it deterministically.

Correlation blindness. Real populations have strong dependencies. Income correlates with subscription tier, household size, device usage, and price sensitivity. These aren't independent variables. But when an LLM generates attributes directly from a prompt, it has no mechanism to enforce statistical relationships. You get incoherent combinations like low-income agents on premium plans who are inexplicably insensitive to price, or large households that never share.

Stated preference echo. The deepest problem. LLMs trained on human text inherit the same biases that make surveys unreliable. When asked "would you cancel Netflix if they crack down on password sharing," the model produces the kind of answer a person might give in a survey: considered, reasonable, probably dramatic. Not the messy, contradictory, network-influenced response that actually happens when real information propagates through a real community and friction shows up at the moment of decision.

The solution isn't to prompt harder. It's to not prompt at all, at least not for agent generation. Instead, you build a compiler.

The Compiler Metaphor

A compiler transforms a high-level language into executable machine code. The programmer never writes assembly. They describe intent, and the compiler produces optimized instructions.

Extropy works the same way. You describe a population in natural language, like "1,000 US Netflix subscribers who share passwords," and the system compiles it into a complete statistical specification: attributes with researched distributions, dependency graphs, conditional modifiers, social network topology, and first-person persona templates. No LLM is asked to be a person. The LLM researches the population, designs the schema, and hands it to a deterministic sampler.

The architecture enforces a clean separation:

The LLM is the compiler. Pure math is the runtime.

This separation has a concrete consequence: the population specification is an intermediate representation. Like an LLVM IR for example, it's a complete, inspectable, editable artifact that sits between intent and execution. You can version control it. Diff it. Hand edit a distribution parameter and resample without rerunning the expensive research step. The spec is the contract between intelligence (LLM) and execution (math), and it makes or breaks the simulation.

The full pipeline runs through sequential stages, each producing artifacts consumed by the next:

spec
extend
sample
network
persona
scenario

The first five stages constitute Phase 1: Population Creation. The sixth is Phase 2: Scenario Compilation. This post covers Phases 1 and 2: how we go from "Netflix password sharing subscribers" to a fully executable simulation specification.

The Pipeline

So the structure below is stage by stage, with one goal: you should understand what each stage guarantees, and why those guarantees matter for realistic simulations.

Phase 1: Population Creation

Phase 1 transforms a natural language population description into sampled agents with social network structure and persona templates. It runs through five commands, each building on the last.

Spec and Extend: From One Prompt to a Real Compiler

This is where most of the intelligence lives. spec and extend turn a natural language population description into a YAML specification that captures the assumptions of the simulation in a way you can read, diff, and edit.

At a high level, the spec answers two questions:

  1. What attributes should agents have?
  2. How do we sample those attributes so the population has the right shape and correlations?

How Attributes Get Generated

In the entropy codebase, spec is a staged compiler. It is not "one big prompt". It is a pipeline that produces a dependency-aware sampling plan:

  1. Sufficiency check: confirm the description has enough detail to define a population (who, how many, where).
  2. Attribute selection: pick a complete set of attributes, typically 25 to 40, across four buckets: Universal (age, household size, income), population-specific (plan tier, Netflix tenure), context-specific (sharing intensity, trust in Netflix), and personality (Big Five, risk tolerance).
  3. Hydration: fill in each attribute with a sampling strategy, distributions, sources, and dependency edges. Independent hydration researches a distribution with sources. Derived hydration writes a deterministic formula. Conditional hydration defines a base distribution plus modifiers that shift it based on upstream attributes.
  4. Constraint binding: topologically sort the dependency graph into a sampling_order so the sampler always has prerequisites before it needs them.
  5. Validation: catch structural errors (invalid distributions, impossible ranges, cyclic dependencies) and warn on semantic foot-guns (no-op modifiers, weird option weights, inconsistent strategies).

Every depends_on relationship is literally an edge in a graph. The compiler writes down that graph and produces an execution order that the sampler can follow deterministically.

Think of spec like a recipe for generating people.

It defines four things:

  1. Independent attributes: "Roll this from a distribution." Example: household_income.
  2. Conditional attributes: "Roll this based on something else." Example: price_sensitivity depends on household_income.
  3. Derived attributes: "Compute this from other fields." Example: monthly_cost from plan_price and add-ons.
  4. Constraints: "Don't let the population drift into nonsense." Example: keep values in-range and enforce basic coherence between attributes.

Those rules are the bridge between research and sampling. The sampler never invents new facts. It just follows the recipe.

Why We Split Attributes Into Independent, Conditional, and Derived

You cannot just sample 30 attributes independently and hope the population looks real. Dependencies are the whole point.

  1. Independent attributes exist because some things have reasonably stable marginals you can ground directly. Age is a distribution. Household size is a distribution. These are good "roots" for the dependency graph.
  2. Conditional attributes exist because many things only make sense given context. A Netflix plan tier depends on income. Sharing intensity depends on household size and relationship type. If you sample these independently, you get incoherent combinations.
  3. Derived attributes exist because some values should not be guessed at all. If a value is a function of other values, computing it keeps the population consistent and makes runs reproducible.

Modifiers are where conditional sampling becomes realistic. Instead of hard-coding rules, the spec can say: "sample from a base distribution, then shift it when conditions are true". For categoricals this often means weight overrides. For numerics this means multiply and add. For booleans this means probability overrides.

Concrete Netflix-shaped example:

  1. Independent: sample education_level and employment_status.
  2. Conditional: sample household_income conditioned on those (education and employment shift the distribution).
  3. Conditional: sample netflix_subscription_type conditioned on household_income (higher income shifts weight toward Premium).
  4. Conditional: sample price_sensitivity conditioned on household_income.
  5. Conditional: sample sharing_relationship conditioned on age (younger users skew toward friends, older users skew toward family).
  6. Derived: compute scenario-specific convenience or cost fields from the concrete values above.

The important part is that the spec encodes the order of operations. You cannot sample price_sensitivity until you have household_income. You cannot compute monthly_cost until you know which plan the agent is on. The compiler writes that dependency structure down so the sampler can execute it deterministically.

If something looks off, that is not a mysterious model behavior. It is a spec problem. The whole point of compiling into YAML is that you can fix the distribution or switch to a better family (e.g., lognormal instead of normal), then resample and see the change immediately.

Why extend Exists

The base spec is meant to be reusable. extend adds scenario-specific traits on top of a stable, grounded base population.

In the Netflix password sharing study, you can think of it like this:

  1. spec: "What does this population look like in general?"
  2. extend: "What extra traits matter for this specific decision?"

That separation matters because it lets you reuse the same grounded population across multiple scenario variants without redoing the heavy research work.

Concretely, extend is where Netflix-specific and scenario-specific attributes get layered in, for example:

  1. Plan context: netflix_subscription_type, netflix_tenure_months, netflix_satisfaction
  2. Sharing context: sharing_relationship, num_people_sharing_with, sharing_duration_months, sharing_account_ownership
  3. Decision drivers: price_sensitivity, social_obligation_weight

These are connected back to the base attributes through conditional dependencies and modifiers, so you get correlations for free. For example, netflix_subscription_type can depend on household_income, and num_people_sharing_with can depend on plan tier.

Validation and Retry

LLMs make routine mistakes (wrong ranges, missing fields, inconsistent dependencies). So spec building is surrounded by validators and a retry loop.

In practice we rely on four checkpoints:

  1. Sufficiency validation: do we have enough context to define the population at all?
  2. Early topology check: once depends_on edges exist, is the dependency graph acyclic?
  3. Constraint binding validation: can we actually produce a valid sampling_order that respects dependencies?
  4. Full spec validation: structural errors (hard failures) plus semantic warnings (things that might be technically valid but suspicious).

When validation fails, the compiler retries with the concrete error messages as feedback. That turns the iteration cycle into something stable: propose, check, repair.

Grounding: The Spec Needs Receipts

When possible, the spec is grounded in real population distributions (not vibes). For this post, the most important example is household income.

extend is the same story, but scoped. It takes a base population and adds scenario-specific attributes that do not exist in census-style datasets. We keep this as a separate step because it lets you reuse a base population across multiple scenarios without redoing the expensive grounding work.

Sample: The Runtime (Deterministic, Fast, Boring)

The sampler is a spec interpreter. It doesn't know about Netflix. It just executes the rules in the spec, in the sampling order implied by dependencies. Same spec plus same random seed equals identical agents. The sampler is deterministic.

The key property is that sampling is not another LLM prompt. It is pure computation, which means it is cheap, repeatable, and easy to debug.

What it does, mechanically, is simple:

  1. Walk attributes in sampling_order (so prerequisites are always present).
  2. For independent attributes, draw from the declared distribution.
  3. For conditional attributes, draw from the base distribution, then apply any modifiers whose conditions evaluate true.
  4. For derived attributes, evaluate the declared formula from already-sampled values.
  5. Clamp to hard constraints (min, max) so values stay in bounds.

If you don't like the population you got, you don't reprompt. You resample.

  1. Change the spec (the recipe) or the seed.
  2. Rerun sampling.
  3. Compare results.

This is why "math runtime" works:

  1. Reproducibility: you can rerun the same population tomorrow and get the same agents.
  2. Iteration speed: you can resample many times while tuning the spec without paying for more LLM calls.
  3. Debuggability: when a downstream result looks wrong, you can inspect the sampled agents and trace it back to the spec.

Output: agents.json, a JSON array of agent dictionaries with an _id field.

Network

This is where Extropy diverges most sharply from simple agent-based models. Agents don't exist in isolation. They exist in social networks, and those networks determine how information propagates. The network generation system is data-driven: all social structure decisions are encoded in a NetworkConfig rather than hardcoded.

Network generation starts by connecting agents based on attribute similarity. Then it applies rewiring to create weak ties, which makes information flow more realistic. [1]

Similarity-only graphs give you tight local clusters, but we found they do not create enough cross-cluster bridges. So the pipeline adds rewiring to introduce weak ties that connect communities and let information travel.

We use an LLM to generate a NetworkConfig for the population: which attributes drive edge formation, what relationship types exist, and how influence flows. The generator then calibrates until network statistics land in reasonable bands.

The output is not just edges. It is a social substrate for simulation: typed relationships plus asymmetric influence weights.

Persona Generation

Raw agent attributes like age: 34, income: 62000, price_sensitivity: 0.78 are useless to an LLM. The persona system converts them into internalized first-person statements that enable genuine embodiment rather than external puppetry.

The difference matters. "This agent has high price sensitivity" triggers the model to describe a character from the outside. "I am much more price sensitive than most people" encourages the model to reason as a person with that trait.

We learned this the hard way. Our first persona template started with instructions like "You are X from Y." The model treated that like a role assignment and responded with hypothetical reasoning. When two agents were similar, they would often produce the same safe response, even when the spec said they should differ.

So we redesigned personas to be declarative and first-person. The goal is not to command the model. The goal is to give it an identity it can inhabit. That shift was one of the biggest wins for response diversity and realism.

The 5 Step PersonaConfig Pipeline

The persona system generates a PersonaConfig once per population (not per-agent) via five LLM steps:

  1. Structure. Decide how to group attributes and which ones should be expressed concretely versus relatively.
  2. Booleans. Generate true and false first-person phrasings.
  3. Categoricals. Generate first-person phrasings for each option.
  4. Relative traits. Express psychological and behavioral traits relative to the population distribution.
  5. Concrete values. Render numbers and units in a way that reads naturally.

Rendering at Simulation Time

At runtime, personas are rendered computationally. There are no per-agent LLM calls. The engine applies the PersonaConfig to each agent's attributes, positions relative traits against the population, and renders a first-person narrative.

When decision_relevant_attributes are defined (by the scenario), those attributes appear first in the persona under a dedicated "Most Relevant to This Decision" section. This ensures the LLM attends to the attributes that matter most when forming its response.

The scalability math is compelling: a handful of calls to generate one config, instead of generating personas separately for every agent.

Phase 2: Scenario Compilation

Phase 2 transforms a natural language scenario description into a machine-executable ScenarioSpec. It bridges human intent ("Netflix announces a $3 price increase") to structured simulation instructions.

Scenario: Turning a Prompt Into a Contract

Compared to spec generation and network generation, scenario compilation is the "clean" part of the system. The hard work has already happened. We have a population, we have a network, we have a persona rendering system. Now we just need to define what happens to them.

The job of scenario is to convert a natural language scenario into a ScenarioSpec with two properties:

  1. It is executable.
  2. It is hard to accidentally misconfigure.

It does this with a short pipeline: parse the event, define who gets exposed and when, define how information spreads through the network, define outcomes to measure, then validate the whole thing against the population and network artifacts. It also auto-configures sensible default simulation parameters based on population size.

Infrastructure: What Keeps It Running

Extropy also has a few practical pieces of plumbing that make the pipeline reliable at scale: model routing (use a stronger model for compilation than for runtime), structured retry with validation feedback, and rate limiting to avoid provider throttles. [2]

The Spec as Artifact

The population specification, the complete population.yaml with all its distributions, modifiers, constraints, and dependency graphs, is the single most important artifact in the system. It is the intermediate representation that makes or breaks the simulation.

Because it's a YAML file, it's human-readable, version controllable, and diffable. You can inspect exactly why your simulated Netflix subscribers have a particular income distribution. You can hand edit a modifier weight, tweak a constraint, add an attribute, and resample without rerunning any LLM calls. The spec captures the expensive intelligence, the research, the statistical reasoning, the dependency modeling, in a format that can be audited, shared, and improved over multiple rounds.

This is the fundamental advantage of the compiler architecture. The spec is not a black box. It's the contract between research and execution. When a simulation produces unexpected results, you trace backward through the spec to find out why. When you need a different scenario for the same population, you extend the existing spec rather than starting from scratch. When a colleague questions your assumptions, you point them at the YAML.

The spec is the floor, not the ceiling. It defines what an agent is. The simulation determines what an agent does. In the Netflix study, the spec defines the economic and behavioral constraints; the simulation determines which households keep paying, which stop sharing, and which churn.

If you want to reason about Extropy like a compiler, the outputs matter as much as the code. Every stage produces an artifact you can inspect, version, and diff. Those artifacts are how you debug, how you iterate, and how you avoid rerunning expensive research.

entropy spec
base.yaml
The grounded population contract.
Key Points
  1. Defines attributes, distributions, dependencies, and constraints.
  2. This is the artifact you polish. Everything downstream inherits its assumptions.
  3. If you change the spec, you can resample without rerunning expensive research.
What It Looks Like
meta:
  description: "1,000 US Netflix subscribers who share passwords"
  size: 1000
attributes:
  - name: age
    type: int
    sampling:
      strategy: independent
      distribution: { type: normal, mean: 41, std: 9 }
sampling_order:
  - age
  - household_income
yaml

What's Next

This post covered how Extropy compiles populations, the journey from natural language to a complete, validated specification with sampled agents, social networks, and persona templates. But the compiled specification is only valuable once you run it.

The next post in this series will cover running the compiled artifacts: how we execute a scenario over timesteps, how we aggregate outcomes, and how we debug a simulation when the output looks wrong.

The compiler builds the world. The simulator brings it to life.

Extropy is open source and available now.Star on GitHubInstall from PyPI

Appendix: Reproducibility Pack

Everything needed to reproduce the chart and inspect the pipeline artifacts is public:

  1. Example grounded population artifacts (Netflix password sharing run): base.yaml, population.yaml, population.persona.yaml, population.scenario.yaml, meta.yaml
  2. Income histogram artifacts used in the chart: Extropy sampled histogram, naive prompted baseline

References

  1. Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of small-world networks. Nature. doi:10.1038/30918
  2. Token bucket (rate limiting). Wikipedia