Clear Ideas Research

AI Capability Index

A Clear Ideas benchmark for complex enterprise reasoning. It measures whether models can decompose ambiguous requests, follow dense instructions, preserve structured state, and return complete outputs that survive automated validation.

Index leader GPT-5.5
79.1
Reliability93
Capability72.3
Stateful78.2
Price/perf.61.1
Model coverageFrontier + open
Benchmark scopeEnterprise reasoning
Best price/perf.GPT-5.4 Mini
Hardest capabilityStateful aggregation
Scoring focusValid output is necessary, but not sufficient.

The index rewards complete, structured responses that preserve dependencies and remain useful after validation, rather than treating any valid object as a strong result.

Core capabilityStateful aggregation is the separator.

Strong results collect intermediate outputs, reuse them across dependent steps, and synthesize them into final answers without losing the structure of the work.

Model configurationSettings are part of the measured system.

Reasoning level, output discipline, latency, and cost all affect whether a model is practical for repeatable enterprise execution.

Leaderboard

Enterprise reasoning under operational constraints.

The Clear Ideas AI Capability Index compares models on normalized scores for capability, stateful aggregation, reliability, sophistication, speed, and price/performance. It is designed for practical enterprise work where answers must be structured, complete, and execution ready.

Capability Index

Normalized index, higher is better

Capability vs. Price/Performance

Upper-right models combine strong reasoning with efficient execution

Top Model Shape

Capability, stateful aggregation, reliability, speed, and value profile

Reliability Index

First-shot completion weighted by output completeness and sophistication

Capability shape

Useful outputs need more than valid structure.

The next charts separate three related dimensions: whether a model preserves state, whether it designs with enough useful complexity, and whether the resulting structure is genuinely high quality.

  • Stateful aggregation tracks reuse of intermediate values.
  • Sophistication rewards purposeful workflow depth.
  • Quality vs. structure catches shallow but valid outputs.

Stateful Aggregation Index

Intermediate values reused, collected, and synthesized into final outputs

Sophistication Index

Complexity, step variety, variable use, tags, and output contracts

Quality vs. Structure

Separates valid-looking output from genuinely useful output

Detailed scores

Overall, capability, reliability, sophistication, and price/performance.

ModelProviderTierOverallCapabilityStatefulReliabilitySoph.Price/perf.Speed
GPT-5.5 OpenAIFrontier79.172.378.292.686.361.177.4
Claude Opus 4.7 AnthropicFrontier77.070.580.191.380.860.956.8
GPT-5.4 Mini OpenAIFrontier75.769.078.089.585.876.554.8
GPT-5.3 Codex OpenAIFrontier75.568.077.090.880.673.480.5
GPT-5.4 OpenAIFrontier75.168.077.091.683.768.260.2
Grok 4.20 xAIFrontier74.067.477.590.177.772.153.5
Grok Code Fast 1 xAIBelow bar68.964.075.273.278.476.168.6
GPT-5.4 Mini Low OpenAIBelow bar68.262.476.177.179.673.366.6
Claude Haiku 4.5 AnthropicBelow bar67.162.478.971.074.672.461.9
GPT-5.4 Nano OpenAIBelow bar64.158.363.876.367.668.269.4
GPT-5 Mini OpenAIBelow bar59.656.465.062.870.564.731.5
GPT-OSS 120B on Groq OpenAIBelow bar55.450.459.563.963.462.366.7
Claude Sonnet 4.6 AnthropicBelow bar46.544.255.747.653.445.121.5
Command A CohereBelow bar44.540.052.557.152.448.619.6
Gemini 3 Pro GoogleBelow bar33.930.931.142.137.335.644.5
Grok 4.1 Fast Non-Reasoning xAIBelow bar24.222.726.524.726.526.424.2
Grok 4.1 Fast Reasoning xAIBelow bar22.820.528.326.424.325.328.5
Gemini 3 Flash GoogleBelow bar22.019.821.127.225.624.327.6
GPT-5 Nano OpenAIBelow bar5.64.57.19.67.38.47.0
GPT-OSS 20B on Groq OpenAIBelow bar0.00.00.00.00.00.00.0

Domain breakdown

Different domains expose different reasoning limits.

Each domain stresses a different capability: grounded extraction, abstract planning, state tracking, aggregation, instruction following, and structured output discipline. Scores are normalized indexes so the shape of each model is easier to compare.

ModelMarket reasoningContract analysisStateful aggregationAction synthesisResearch synthesisFinancial reasoning
GPT-5.571.470.770.371.270.571.8
Claude Opus 4.771.267.365.264.365.571.3
GPT-5.4 Mini6969.263.568.566.865.3
GPT-5.3 Codex68.567.866.26870.769.5
GPT-5.47067.569.265.870.769.2
Grok 4.2070.568.266.867.369.770.3
Grok Code Fast 168.260.367.265.972.367.7
GPT-5.4 Mini Low68.169.45469.265.764.2
Claude Haiku 4.569.361.25966.162.258
GPT-5.4 Nano70.570.37.570.165.971.1
GPT-5 Mini7066.2067.570.769.3
GPT-OSS 120B on Groq73.970.8059.170.763
Claude Sonnet 4.670.561.563.768.200
Command A66.755.7054.862.749.1
Gemini 3 Pro0.368.60069.969.2
Grok 4.1 Fast Non-Reasoning0.3006869.70
Grok 4.1 Fast Reasoning00.36667.100
Gemini 3 Flash000068.258.8
GPT-5 Nano0043.7000
GPT-OSS 20B on Groq000000

Threshold view

Models below the bar are still informative.

Lower-scoring models remain visible as a compact diagnostic view of completion and stateful aggregation gaps.

xAIGrok Code Fast 1
68.9
6/6 completeStateful 75.2
OpenAIGPT-5.4 Mini Low
68.2
6/6 completeStateful 76.1
AnthropicClaude Haiku 4.5
67.1
6/6 completeStateful 78.9
OpenAIGPT-5.4 Nano
64.1
5/6 completeStateful 63.8
OpenAIGPT-5 Mini
59.6
5/6 completeStateful 65.0
OpenAIGPT-OSS 120B on Groq
55.4
5/6 completeStateful 59.5
AnthropicClaude Sonnet 4.6
46.5
4/6 completeStateful 55.7
CohereCommand A
44.5
5/6 completeStateful 52.5
GoogleGemini 3 Pro
33.9
3/6 completeStateful 31.1
xAIGrok 4.1 Fast Non-Reasoning
24.2
2/6 completeStateful 26.5
xAIGrok 4.1 Fast Reasoning
22.8
2/6 completeStateful 28.3
GoogleGemini 3 Flash
22.0
2/6 completeStateful 21.1
OpenAIGPT-5 Nano
5.6
1/6 completeStateful 7.1
OpenAIGPT-OSS 20B on Groq
0.0
0/6 completeStateful 0.0

What the benchmark tests

The benchmark stresses dense enterprise reasoning: extraction, research synthesis, state tracking, constraint following, aggregation, and complete machine-readable output.

Why it is hard

The model has to reason about values that do not exist yet, carry them through multiple dependent steps, respect constraints, and produce an object that survives automated validation.

How scoring works

Scores are normalized indexes. Overall performance combines capability, structural discipline, stateful aggregation, coverage, reliability, sophistication, speed, and price/performance so models can be compared without exposing raw execution mechanics. A score of 100 is a theoretical expert ceiling, not a reward for simply producing a valid output. Correction attempts are treated as reliability costs; useful clarification is different from repair after an invalid result.

Whitepaper

Read the technical paper behind the index.

The paper explains the benchmark design, scoring model, statistical appendix, and why stateful aggregation is the key capability separating frontier models from partial performers.

Download PDF Clear Ideas AI Capability Index, April 27, 2026