Inspired by Chatbot Arena and 3D Arena — but for engineering-grade parametric CAD.
Describe a mechanical part in plain English. From simple primitives to complex functional assemblies.
See outputs from 13+ models rendered side-by-side in 3D. Inspect geometry, download STEP files, view the generated code.
Cast a pairwise vote. Results feed into Elo-based rankings. Automated metrics (validity, Chamfer distance, VLM score) run in parallel.
7 academic · 3 commercial · 3 LLM baselines
170K models, 4 abstraction levels
Unified controllable generation
Chain-of-thought + geometric reward RL
Self-correction: 53% → 85% exec success
Iterative visual refinement loop
Spatial reasoning multimodal LLM
Foundational baseline — 178K models
$30M+ funded, public API
$4.1M seed, mech. engineering focus
Commercial text-to-CAD API
93% invalid rate (Text2CAD eval)
Strong code model — untested on CAD
85% compile rate on CADPrompt
~200 prompts across 4 difficulty tiers. Fixed set for reproducible evaluation. Models are scored on validity rate, Chamfer distance, and VLM-judged prompt adherence.
20 prompts across 4 difficulty tiers. Metric: % of prompts that produced a valid, executable 3D part. Full leaderboard launching soon.
| RANK | MODEL | TYPE | VALID STL | SYNTAX OK | AVG LATENCY | PROMPTS PASSED | NOTES |
|---|---|---|---|---|---|---|---|
| #1 | Claude Opus 4.6 | LLM Baseline | 90% | 100% | 6.9s | 19 / 20 | Perfect T1–T3. Only tier 4 failures. |
| #1 | Zoo ML-ephant | Commercial | 95% | 95% | 11.1s | 19 / 20 | Tied with Claude. Returns native geometry. |
| #3 | Gemini 2.5 Flash | LLM Baseline | 70% | 100% | 3.1s | 14 / 20 | Fastest. Hallucinates methods at T4. |
| #4 | GPT-5 | LLM Baseline | 60% | 60% | 16.1s | 12 / 20 | Token truncation kills all T4 prompts. |
These are API-only baseline results on 20 prompts. Full benchmark (200 prompts, 13+ models including academic open-source models) is in progress. Gemini result reflects free-tier rate limiting, not model quality.
The 2025 survey Large Language Models for Computer-Aided Design explicitly identifies this as the field's most critical gap.
Sequence-based (Text2CAD), code-based (CAD-Coder), and B-rep direct (BrepGen) models are evaluated on different benchmarks with different metrics. You can't compare results across papers.
Commercial tools like Zoo and AdamCAD are never included in academic benchmark tables. Academic models are never in commercial tool comparisons. Nobody has done both.
Every benchmark is a static snapshot tied to a paper. There's no place where new models submit and get ranked continuously — no SWE-bench equivalent for CAD.
Unlike image generation (FID, CLIP score) or code (pass@k), CAD has no community-consensus quality metric. Papers pick different metrics, making progress hard to track.
We are preparing a benchmark paper targeting NeurIPS 2026 Datasets & Benchmarks. The paper will evaluate all listed models on the fixed benchmark, propose standardized metrics, and describe the arena platform.
Get notified when the leaderboard launches and the preprint drops.
contact@cadarena.ai →Are you working on a text-to-CAD model and want it on the leaderboard? We want to hear from you.