close
Skip to content

Benchmark Run: 12 Models on wp-core-v1 #5

@Jameswlepage

Description

@Jameswlepage

WP-Bench Test Run

Quick test of the benchmark harness against 12 models using the wp-core-v1 dataset.

Results

Model Knowledge Correctness Overall
claude-sonnet-4-5-20250929 88.1% 47.9% 45.6%
gpt-5.2 90.5% 44.4% 44.9%
deepseek/deepseek-reasoner 83.3% 48.6% 44.4%
gpt-5-mini 83.3% 43.8% 42.5%
xai/grok-4-1-fast-reasoning 85.7% 41.7% 42.4%
claude-opus-4-5-20251101 71.4% 50.0% 41.4%
gemini/gemini-3-flash-preview 71.4% 47.9% 40.6%
deepseek/deepseek-chat 71.4% 46.5% 40.0%
xai/grok-4-1-fast-non-reasoning 76.2% 41.7% 39.5%
groq/llama-3.3-70b-versatile 81.0% 35.4% 38.5%
gpt-3.5-turbo 73.8% 27.1% 33.0%
groq/llama-3.1-8b-instant 76.2% 20.8% 31.2%

Dataset: wp-core-v1 (42 knowledge + 24 execution tests)

Takeaways

  • Frontier models cluster around 40-46% overall
  • Knowledge scores generally strong (70-90%), correctness is the differentiator
  • Clear tier gap between frontier and smaller models

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions