Inspired by Chatbot Arena and 3D Arena — but for engineering-grade parametric CAD.
Describe a mechanical part in plain English. From simple primitives to complex functional assemblies.
See outputs from multiple models rendered side-by-side in 3D. Inspect geometry, view the generated code, see where each model fails.
Explore the full benchmark grid — 20 prompts × 5 models. Click any cell to see the 3D output, source code, and failure analysis.
5 models evaluated on the full 20-prompt benchmark. More being added.
20 prompts across 4 difficulty tiers. A prompt scores ✓ if it produces a valid, executable 3D part.
20 prompts across 4 difficulty tiers. Metric: % of prompts that produced a valid, executable 3D part. Full leaderboard launching soon.
| RANK | MODEL | TYPE | VALID STL | SYNTAX OK | AVG LATENCY | PROMPTS PASSED | NOTES |
|---|---|---|---|---|---|---|---|
| #1 | Claude Opus 4.6 | LLM Baseline | 90% | 100% | 6.9s | 19 / 20 | Perfect T1–T3. Only tier 4 failures. |
| #1 | Zoo ML-ephant | Commercial | 95% | 95% | 11.1s | 19 / 20 | Tied with Claude. Returns native geometry. |
| #3 | Gemini 2.5 Flash | LLM Baseline | 70% | 100% | 3.1s | 14 / 20 | Fastest. Hallucinates methods at T4. |
| #4 | GPT-5 | LLM Baseline | 60% | 60% | 16.1s | 12 / 20 | Token truncation kills all T4 prompts. |
These are API-only results on 20 hand-selected prompts, run and reviewed manually. More models and prompts being added. Gemini result reflects free-tier rate limiting, not model quality.
We're benchmarking every AI-for-CAD model. Get notified when we add new models, publish results, or release our paper.
Working on a text-to-CAD model? Reach out at contact@cadarena.dev