Docs · Pack Calibration

Pack calibration

Calibration is how a persona pack proves its fit to a known human baseline. Run synthpanel pack calibrate and the result is written back into the pack YAML so the manifest is self-describing.

CLI reference

synthpanel pack calibrate <PACK_YAML>
  --against DATASET:QUESTION   # e.g. gss:HAPPY (v1 allowlist: gss, ntia)
  [--n 50]                     # panel size
  [--models MODELS]            # default: panel run default
  [--samples-per-question 15]  # for stable JSD
  [--output PATH]              # default: rewrite PACK_YAML in place
  [--dry-run]                  # print what would be written
  [--yes]                      # skip overwrite confirm

After a successful run the pack YAML gains a top-level calibration: list:

calibration:
  - dataset: gss
    question: HAPPY
    jsd: 0.18
    n: 100
    samples_per_question: 15
    models: [haiku:0.30, gemini-flash-lite:0.30, qwen3-plus:0.20, deepseek-v3:0.20]
    extractor: pick_one:auto-derived
    panelist_cost_usd: 0.6451
    calibrated_at: 2026-04-26T14:23:00Z
    synthpanel_version: 0.11.1
    methodology_url: https://synthpanel.dev/docs/calibration

What JSD means

Calibration reports Jensen-Shannon divergence between the pack's extracted answer distribution and the dataset's published human distribution. JSD is bounded in [0, 1] (using log₂):

JSD Interpretation
< 0.05 Tightly aligned — pack mirrors the human distribution.
0.05–0.15 Strong alignment — small but real distributional drift.
0.15–0.30 Moderate — usable for directional research, not census.
0.30–0.50 Weak — major disagreement on at least one mode.
> 0.50 Effectively uncalibrated.

JSD is computed locally via the same metric the convergence tracker uses (see src/synth_panel/convergence.py); the pack YAML is the only durable artifact — calibration does not phone home and is distinct from --submit-to-synthbench.

When to calibrate

Calibrate a pack when any of the following is true:

You do not need to calibrate before every panel run. The calibration: list is a snapshot, not a runtime guard.

Choosing a baseline

The --against DATASET:QUESTION flag accepts the v1 inline-publishable allowlist:

Dataset Notes
gss General Social Survey aggregates. HAPPY is the canonical demo question (3-option Likert); HEALTH and LIFE also work.
ntia NTIA Internet Use Survey. Useful for technology-adoption packs.

Other datasets (OpinionsQA, PewTech, GlobalOpinionQA, WVS, …) are gated for redistribution and require post-hoc calibration via synthbench directly. They are intentionally not exposed here so a calibrated pack YAML can be checked into a public repo without relicensing concerns.

To override the allowlist for internal use:

$ SYNTHBENCH_ALLOW_GATED=1 synthpanel pack calibrate ... --against wvs:Q1

Methodology

For each calibration run:

  1. 1 The SynthBench baseline payload is fetched (same loader as panel run --calibrate-against).
  2. 2 A pick_one extraction schema is auto-derived from the baseline's small-enum distribution. If the baseline is Likert/ranking or too-wide, the command hard-fails with a hint to supply --extract-schema via panel run instead.
  3. 3 A panel of --n panelists is run against the dataset's question text. Each panelist answers --samples-per-question times.
  4. 4 The model distribution is compared against the baseline's human_distribution via Jensen-Shannon divergence.
  5. 5 The resulting JSD, plus provenance (extractor, models, cost, timestamp, synthpanel version), is merged into the pack YAML's calibration: list. Re-running against the same (dataset, question) pair replaces the prior entry rather than appending, so the list is always a clean record of one-result-per-baseline.

Idempotence and re-runs

Limitations