The index rewards complete, structured responses that preserve dependencies and remain useful after validation, rather than treating any valid object as a strong result.
Clear Ideas Research
AI Capability Index
A Clear Ideas benchmark for complex enterprise reasoning. It measures whether models can decompose ambiguous requests, follow dense instructions, preserve structured state, and return complete outputs that survive automated validation.
Strong results collect intermediate outputs, reuse them across dependent steps, and synthesize them into final answers without losing the structure of the work.
Reasoning level, output discipline, latency, and cost all affect whether a model is practical for repeatable enterprise execution.
Leaderboard
Enterprise reasoning under operational constraints.
The Clear Ideas AI Capability Index compares models on normalized scores for capability, stateful aggregation, reliability, sophistication, speed, and price/performance. It is designed for practical enterprise work where answers must be structured, complete, and execution ready.
Capability Index
Normalized index, higher is betterCapability vs. Price/Performance
Upper-right models combine strong reasoning with efficient executionTop Model Shape
Capability, stateful aggregation, reliability, speed, and value profileReliability Index
First-shot completion weighted by output completeness and sophisticationCapability shape
Useful outputs need more than valid structure.
The next charts separate three related dimensions: whether a model preserves state, whether it designs with enough useful complexity, and whether the resulting structure is genuinely high quality.
- Stateful aggregation tracks reuse of intermediate values.
- Sophistication rewards purposeful workflow depth.
- Quality vs. structure catches shallow but valid outputs.
Stateful Aggregation Index
Intermediate values reused, collected, and synthesized into final outputsSophistication Index
Complexity, step variety, variable use, tags, and output contractsQuality vs. Structure
Separates valid-looking output from genuinely useful outputDetailed scores
Overall, capability, reliability, sophistication, and price/performance.
| Model | Provider | Tier | Overall | Capability | Stateful | Reliability | Soph. | Price/perf. | Speed |
|---|---|---|---|---|---|---|---|---|---|
| GPT-5.5 | OpenAI | Frontier | 79.1 | 72.3 | 78.2 | 92.6 | 86.3 | 61.1 | 77.4 |
| Claude Opus 4.7 | Anthropic | Frontier | 77.0 | 70.5 | 80.1 | 91.3 | 80.8 | 60.9 | 56.8 |
| GPT-5.4 Mini | OpenAI | Frontier | 75.7 | 69.0 | 78.0 | 89.5 | 85.8 | 76.5 | 54.8 |
| GPT-5.3 Codex | OpenAI | Frontier | 75.5 | 68.0 | 77.0 | 90.8 | 80.6 | 73.4 | 80.5 |
| GPT-5.4 | OpenAI | Frontier | 75.1 | 68.0 | 77.0 | 91.6 | 83.7 | 68.2 | 60.2 |
| Grok 4.20 | xAI | Frontier | 74.0 | 67.4 | 77.5 | 90.1 | 77.7 | 72.1 | 53.5 |
| Grok Code Fast 1 | xAI | Below bar | 68.9 | 64.0 | 75.2 | 73.2 | 78.4 | 76.1 | 68.6 |
| GPT-5.4 Mini Low | OpenAI | Below bar | 68.2 | 62.4 | 76.1 | 77.1 | 79.6 | 73.3 | 66.6 |
| Claude Haiku 4.5 | Anthropic | Below bar | 67.1 | 62.4 | 78.9 | 71.0 | 74.6 | 72.4 | 61.9 |
| GPT-5.4 Nano | OpenAI | Below bar | 64.1 | 58.3 | 63.8 | 76.3 | 67.6 | 68.2 | 69.4 |
| GPT-5 Mini | OpenAI | Below bar | 59.6 | 56.4 | 65.0 | 62.8 | 70.5 | 64.7 | 31.5 |
| GPT-OSS 120B on Groq | OpenAI | Below bar | 55.4 | 50.4 | 59.5 | 63.9 | 63.4 | 62.3 | 66.7 |
| Claude Sonnet 4.6 | Anthropic | Below bar | 46.5 | 44.2 | 55.7 | 47.6 | 53.4 | 45.1 | 21.5 |
| Command A | Cohere | Below bar | 44.5 | 40.0 | 52.5 | 57.1 | 52.4 | 48.6 | 19.6 |
| Gemini 3 Pro | Below bar | 33.9 | 30.9 | 31.1 | 42.1 | 37.3 | 35.6 | 44.5 | |
| Grok 4.1 Fast Non-Reasoning | xAI | Below bar | 24.2 | 22.7 | 26.5 | 24.7 | 26.5 | 26.4 | 24.2 |
| Grok 4.1 Fast Reasoning | xAI | Below bar | 22.8 | 20.5 | 28.3 | 26.4 | 24.3 | 25.3 | 28.5 |
| Gemini 3 Flash | Below bar | 22.0 | 19.8 | 21.1 | 27.2 | 25.6 | 24.3 | 27.6 | |
| GPT-5 Nano | OpenAI | Below bar | 5.6 | 4.5 | 7.1 | 9.6 | 7.3 | 8.4 | 7.0 |
| GPT-OSS 20B on Groq | OpenAI | Below bar | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Domain breakdown
Different domains expose different reasoning limits.
Each domain stresses a different capability: grounded extraction, abstract planning, state tracking, aggregation, instruction following, and structured output discipline. Scores are normalized indexes so the shape of each model is easier to compare.
| Model | Market reasoning | Contract analysis | Stateful aggregation | Action synthesis | Research synthesis | Financial reasoning |
|---|---|---|---|---|---|---|
| GPT-5.5 | 71.4 | 70.7 | 70.3 | 71.2 | 70.5 | 71.8 |
| Claude Opus 4.7 | 71.2 | 67.3 | 65.2 | 64.3 | 65.5 | 71.3 |
| GPT-5.4 Mini | 69 | 69.2 | 63.5 | 68.5 | 66.8 | 65.3 |
| GPT-5.3 Codex | 68.5 | 67.8 | 66.2 | 68 | 70.7 | 69.5 |
| GPT-5.4 | 70 | 67.5 | 69.2 | 65.8 | 70.7 | 69.2 |
| Grok 4.20 | 70.5 | 68.2 | 66.8 | 67.3 | 69.7 | 70.3 |
| Grok Code Fast 1 | 68.2 | 60.3 | 67.2 | 65.9 | 72.3 | 67.7 |
| GPT-5.4 Mini Low | 68.1 | 69.4 | 54 | 69.2 | 65.7 | 64.2 |
| Claude Haiku 4.5 | 69.3 | 61.2 | 59 | 66.1 | 62.2 | 58 |
| GPT-5.4 Nano | 70.5 | 70.3 | 7.5 | 70.1 | 65.9 | 71.1 |
| GPT-5 Mini | 70 | 66.2 | 0 | 67.5 | 70.7 | 69.3 |
| GPT-OSS 120B on Groq | 73.9 | 70.8 | 0 | 59.1 | 70.7 | 63 |
| Claude Sonnet 4.6 | 70.5 | 61.5 | 63.7 | 68.2 | 0 | 0 |
| Command A | 66.7 | 55.7 | 0 | 54.8 | 62.7 | 49.1 |
| Gemini 3 Pro | 0.3 | 68.6 | 0 | 0 | 69.9 | 69.2 |
| Grok 4.1 Fast Non-Reasoning | 0.3 | 0 | 0 | 68 | 69.7 | 0 |
| Grok 4.1 Fast Reasoning | 0 | 0.3 | 66 | 67.1 | 0 | 0 |
| Gemini 3 Flash | 0 | 0 | 0 | 0 | 68.2 | 58.8 |
| GPT-5 Nano | 0 | 0 | 43.7 | 0 | 0 | 0 |
| GPT-OSS 20B on Groq | 0 | 0 | 0 | 0 | 0 | 0 |
What the benchmark tests
The benchmark stresses dense enterprise reasoning: extraction, research synthesis, state tracking, constraint following, aggregation, and complete machine-readable output.
Why it is hard
The model has to reason about values that do not exist yet, carry them through multiple dependent steps, respect constraints, and produce an object that survives automated validation.
How scoring works
Scores are normalized indexes. Overall performance combines capability, structural discipline, stateful aggregation, coverage, reliability, sophistication, speed, and price/performance so models can be compared without exposing raw execution mechanics. A score of 100 is a theoretical expert ceiling, not a reward for simply producing a valid output. Correction attempts are treated as reliability costs; useful clarification is different from repair after an invalid result.
Whitepaper
Read the technical paper behind the index.
The paper explains the benchmark design, scoring model, statistical appendix, and why stateful aggregation is the key capability separating frontier models from partial performers.
Follow Clear Ideas Research
Get new benchmark updates, model comparisons, and practical notes on governed AI.