How does your model behave?

Not what it knows. Not what it scores. How it acts under pressure, temptation, and uncertainty.

21 Experiments
6 Categories
4 Providers
0 Profiles

Behavioral Profiles

Quantitative behavioral fingerprints. Click a profile to explore.

No profiles yet. Run an experiment to generate one.

Behavioral Radar

Run Experiments

Configuration

Results

Configure and run experiments to see results here.

Compare Models

Select profiles to overlay on a radar chart.

Experiment Library

21 experiments across 6 categories, grounded in peer-reviewed behavioral economics research.

About Bench

The Problem

Every LLM benchmark measures capability — MMLU tests knowledge, HumanEval tests coding. But none answer the question that matters when you deploy a model: How does it act?

Does it take risks or play it safe? Is it consistent across framings? Will it tell you what you want to hear? How does it negotiate?

The Solution

Bench runs structured behavioral experiments from decades of experimental economics and behavioral psychology research. It produces a quantitative behavioral profile — a personality fingerprint for AI.

6 Categories

🎲 Risk — Prospect theory, loss aversion, lottery choices
🎮 Games — Prisoner's dilemma, ultimatum, trust, public goods
🧠 Bias — Anchoring, framing effects, base rate neglect
👥 Social — Sycophancy, authority bias, consistency under pressure
Temporal — Discounting, present bias, time consistency
Moral — Trolley problems, moral foundations, distributive justice

CLI Usage

# Run all experiments on a model
bench run --model gpt-4o --suite full

# Compare two models
bench compare --models gpt-4o,claude-sonnet-4-6

# Generate behavioral profile
bench profile --model gpt-4o --output profile.json

# Launch dashboard
bench dashboard --port 8080