Not what it knows. Not what it scores. How it acts under pressure, temptation, and uncertainty.
Quantitative behavioral fingerprints. Click a profile to explore.
Configure and run experiments to see results here.
Select profiles to overlay on a radar chart.
21 experiments across 6 categories, grounded in peer-reviewed behavioral economics research.
Every LLM benchmark measures capability — MMLU tests knowledge, HumanEval tests coding. But none answer the question that matters when you deploy a model: How does it act?
Does it take risks or play it safe? Is it consistent across framings? Will it tell you what you want to hear? How does it negotiate?
Bench runs structured behavioral experiments from decades of experimental economics and behavioral psychology research. It produces a quantitative behavioral profile — a personality fingerprint for AI.
# Run all experiments on a model
bench run --model gpt-4o --suite full
# Compare two models
bench compare --models gpt-4o,claude-sonnet-4-6
# Generate behavioral profile
bench profile --model gpt-4o --output profile.json
# Launch dashboard
bench dashboard --port 8080