Clear Ideas Research

AI Capability Index

A Clear Ideas benchmark for complex enterprise reasoning. It measures whether models can decompose ambiguous requests, follow dense instructions, preserve structured state, and return complete outputs that survive automated validation.

View leaderboard Read methodology Download whitepaper

Index leader GPT-5.5

79.1

Reliability93

Capability72.3

Stateful78.2

Price/perf.61.1

Model coverageFrontier + open

Benchmark scopeEnterprise reasoning

Best price/perf.GPT-5.4 Mini

Hardest capabilityStateful aggregation

Scoring focusValid output is necessary, but not sufficient.

The index rewards complete, structured responses that preserve dependencies and remain useful after validation, rather than treating any valid object as a strong result.

Core capabilityStateful aggregation is the separator.

Strong results collect intermediate outputs, reuse them across dependent steps, and synthesize them into final answers without losing the structure of the work.

Model configurationSettings are part of the measured system.

Reasoning level, output discipline, latency, and cost all affect whether a model is practical for repeatable enterprise execution.

Leaderboard

Enterprise reasoning under operational constraints.

The Clear Ideas AI Capability Index compares models on normalized scores for capability, stateful aggregation, reliability, sophistication, speed, and price/performance. It is designed for practical enterprise work where answers must be structured, complete, and execution ready.

Capability Index

Normalized index, higher is better

Capability vs. Price/Performance

Upper-right models combine strong reasoning with efficient execution

Top Model Shape

Capability, stateful aggregation, reliability, speed, and value profile

Reliability Index

First-shot completion weighted by output completeness and sophistication

Capability shape

Useful outputs need more than valid structure.

The next charts separate three related dimensions: whether a model preserves state, whether it designs with enough useful complexity, and whether the resulting structure is genuinely high quality.

Stateful aggregation tracks reuse of intermediate values.
Sophistication rewards purposeful workflow depth.
Quality vs. structure catches shallow but valid outputs.

Stateful Aggregation Index

Intermediate values reused, collected, and synthesized into final outputs

Sophistication Index

Complexity, step variety, variable use, tags, and output contracts

Quality vs. Structure

Separates valid-looking output from genuinely useful output

Detailed scores

Overall, capability, reliability, sophistication, and price/performance.

Model	Provider	Tier	Overall	Capability	Stateful	Reliability	Soph.	Price/perf.	Speed
GPT-5.5	OpenAI	Frontier	79.1	72.3	78.2	92.6	86.3	61.1	77.4
Claude Opus 4.7	Anthropic	Frontier	77.0	70.5	80.1	91.3	80.8	60.9	56.8
GPT-5.4 Mini	OpenAI	Frontier	75.7	69.0	78.0	89.5	85.8	76.5	54.8
GPT-5.3 Codex	OpenAI	Frontier	75.5	68.0	77.0	90.8	80.6	73.4	80.5
GPT-5.4	OpenAI	Frontier	75.1	68.0	77.0	91.6	83.7	68.2	60.2
Grok 4.20	xAI	Frontier	74.0	67.4	77.5	90.1	77.7	72.1	53.5
Grok Code Fast 1 Retires May 15, 2026. Use Grok 4.3.	xAI	Below bar	68.9	64.0	75.2	73.2	78.4	76.1	68.6
GPT-5.4 Mini Low	OpenAI	Below bar	68.2	62.4	76.1	77.1	79.6	73.3	66.6
Claude Haiku 4.5	Anthropic	Below bar	67.1	62.4	78.9	71.0	74.6	72.4	61.9
GPT-5.4 Nano	OpenAI	Below bar	64.1	58.3	63.8	76.3	67.6	68.2	69.4
GPT-5 Mini	OpenAI	Below bar	59.6	56.4	65.0	62.8	70.5	64.7	31.5
Grok 4.3	xAI	Below bar	55.8	53.6	62.6	58.1	64.9	58.8	6.8
GPT-OSS 120B on Groq	OpenAI	Below bar	55.4	50.4	59.5	63.9	63.4	62.3	66.7
Claude Sonnet 4.6	Anthropic	Below bar	46.5	44.2	55.7	47.6	53.4	45.1	21.5
Command A	Cohere	Below bar	44.5	40.0	52.5	57.1	52.4	48.6	19.6
Gemini 3 Pro	Google	Below bar	33.9	30.9	31.1	42.1	37.3	35.6	44.5
Grok 4.1 Fast Non-Reasoning Retires May 15, 2026. Use Grok 4.20 Non-Reasoning.	xAI	Below bar	24.2	22.7	26.5	24.7	26.5	26.4	24.2
Grok 4.1 Fast Reasoning Retires May 15, 2026. Use Grok 4.3.	xAI	Below bar	22.8	20.5	28.3	26.4	24.3	25.3	28.5
Gemini 3 Flash	Google	Below bar	22.0	19.8	21.1	27.2	25.6	24.3	27.6
GPT-5 Nano	OpenAI	Below bar	5.6	4.5	7.1	9.6	7.3	8.4	7.0
GPT-OSS 20B on Groq	OpenAI	Below bar	0.0	0.0	0.0	0.0	0.0	0.0	0.0

Domain breakdown

Different domains expose different reasoning limits.

Each domain stresses a different capability: grounded extraction, abstract planning, state tracking, aggregation, instruction following, and structured output discipline. Scores are normalized indexes so the shape of each model is easier to compare.

Model	Market reasoning	Contract analysis	Stateful aggregation	Action synthesis	Research synthesis	Financial reasoning
GPT-5.5	71.4	70.7	70.3	71.2	70.5	71.8
Claude Opus 4.7	71.2	67.3	65.2	64.3	65.5	71.3
GPT-5.4 Mini	69	69.2	63.5	68.5	66.8	65.3
GPT-5.3 Codex	68.5	67.8	66.2	68	70.7	69.5
GPT-5.4	70	67.5	69.2	65.8	70.7	69.2
Grok 4.20	70.5	68.2	66.8	67.3	69.7	70.3
Grok Code Fast 1	68.2	60.3	67.2	65.9	72.3	67.7
GPT-5.4 Mini Low	68.1	69.4	54	69.2	65.7	64.2
Claude Haiku 4.5	69.3	61.2	59	66.1	62.2	58
GPT-5.4 Nano	70.5	70.3	7.5	70.1	65.9	71.1
GPT-5 Mini	70	66.2	0	67.5	70.7	69.3
Grok 4.3	65.8	0	60.2	64.8	70.2	70
GPT-OSS 120B on Groq	73.9	70.8	0	59.1	70.7	63
Claude Sonnet 4.6	70.5	61.5	63.7	68.2	0	0
Command A	66.7	55.7	0	54.8	62.7	49.1
Gemini 3 Pro	0.3	68.6	0	0	69.9	69.2
Grok 4.1 Fast Non-Reasoning	0.3	0	0	68	69.7	0
Grok 4.1 Fast Reasoning	0	0.3	66	67.1	0	0
Gemini 3 Flash	0	0	0	0	68.2	58.8
GPT-5 Nano	0	0	43.7	0	0	0
GPT-OSS 20B on Groq	0	0	0	0	0	0

Threshold view

Models below the bar are still informative.

Lower-scoring models remain visible as a compact diagnostic view of completion and stateful aggregation gaps.

xAIGrok Code Fast 1Retires May 15, 2026

68.9

6/6 completeStateful 75.2

OpenAIGPT-5.4 Mini Low

68.2

6/6 completeStateful 76.1

AnthropicClaude Haiku 4.5

67.1

6/6 completeStateful 78.9

OpenAIGPT-5.4 Nano

64.1

5/6 completeStateful 63.8

OpenAIGPT-5 Mini

59.6

5/6 completeStateful 65.0

xAIGrok 4.3

55.8

5/6 completeStateful 62.6

OpenAIGPT-OSS 120B on Groq

55.4

5/6 completeStateful 59.5

AnthropicClaude Sonnet 4.6

46.5

4/6 completeStateful 55.7

CohereCommand A

44.5

5/6 completeStateful 52.5

GoogleGemini 3 Pro

33.9

3/6 completeStateful 31.1

xAIGrok 4.1 Fast Non-ReasoningRetires May 15, 2026

24.2

2/6 completeStateful 26.5

xAIGrok 4.1 Fast ReasoningRetires May 15, 2026

22.8

2/6 completeStateful 28.3

GoogleGemini 3 Flash

22.0

2/6 completeStateful 21.1

OpenAIGPT-5 Nano

5.6

1/6 completeStateful 7.1

OpenAIGPT-OSS 20B on Groq

0.0

0/6 completeStateful 0.0

What the benchmark tests

The benchmark stresses dense enterprise reasoning: extraction, research synthesis, state tracking, constraint following, aggregation, and complete machine-readable output.

Why it is hard

The model has to reason about values that do not exist yet, carry them through multiple dependent steps, respect constraints, and produce an object that survives automated validation.

How scoring works

Scores are normalized indexes. Overall performance combines capability, structural discipline, stateful aggregation, coverage, reliability, sophistication, speed, and price/performance so models can be compared without exposing raw execution mechanics. A score of 100 is a theoretical expert ceiling, not a reward for simply producing a valid output. Correction attempts are treated as reliability costs; useful clarification is different from repair after an invalid result.

Whitepaper

Read the technical paper behind the index.

The paper explains the benchmark design, scoring model, statistical appendix, and why stateful aggregation is the key capability separating frontier models from partial performers.

Download PDF Clear Ideas AI Capability Index, April 27, 2026