OPEN RESEARCH · LAUNCHING 2026

The first open benchmark
for AI-generated parametric CAD

Enter a text prompt. Compare outputs from 13+ models — academic and commercial — side by side. Vote for the best result. Watch the leaderboard evolve.

Browse Models →See the Benchmark
173
Papers analyzed
13+
Models compared
~200
Benchmark prompts
4
Difficulty tiers

How it works

Inspired by Chatbot Arena and 3D Arena — but for engineering-grade parametric CAD.

STEP 01
✏️

Enter a text prompt

Describe a mechanical part in plain English. From simple primitives to complex functional assemblies.

STEP 02
⚙️

Compare model outputs

See outputs from 13+ models rendered side-by-side in 3D. Inspect geometry, download STEP files, view the generated code.

STEP 03
📊

Vote + see metrics

Cast a pairwise vote. Results feed into Elo-based rankings. Automated metrics (validity, Chamfer distance, VLM score) run in parallel.

Models included

7 academic  ·  3 commercial  ·  3 LLM baselines

AcademicCommercialLLM Baseline
Text2CAD
NeurIPS Spotlight · 2024
Academic
IN: TextOUT: CAD sequences

170K models, 4 abstraction levels

FlexCAD
ICLR · 2025
Academic
IN: Text / multi-condOUT: CAD sequences

Unified controllable generation

CAD-Coder
arXiv · 2025
Academic
IN: TextOUT: CAD code

Chain-of-thought + geometric reward RL

Text-to-CadQuery
arXiv · 2025
Academic
IN: TextOUT: CadQuery Python

Self-correction: 53% → 85% exec success

CADFusion
arXiv · 2025
Academic
IN: Text + visual feedbackOUT: CadQuery

Iterative visual refinement loop

CAD-GPT
arXiv · 2025
Academic
IN: Text + imageOUT: CAD sequences

Spatial reasoning multimodal LLM

DeepCAD
ICCV · 2021
Academic
IN: UnconditionalOUT: CAD sequences

Foundational baseline — 178K models

Zoo / ML-ephant
zoo.dev · 2025
Commercial
IN: TextOUT: STEP / STL / OBJ

$30M+ funded, public API

AdamCAD
YC W25 · 2025
Commercial
IN: TextOUT: STEP

$4.1M seed, mech. engineering focus

CADGPT
cadgpt.ai · 2025
Commercial
IN: TextOUT: STEP

Commercial text-to-CAD API

GPT-4o (zero-shot)
OpenAI · 2024
LLM Baseline
IN: TextOUT: OpenSCAD / CadQuery

93% invalid rate (Text2CAD eval)

Claude Sonnet (zero-shot)
Anthropic · 2025
LLM Baseline
IN: TextOUT: CadQuery

Strong code model — untested on CAD

Gemini 2.0 (zero-shot)
Google · 2025
LLM Baseline
IN: TextOUT: CadQuery

85% compile rate on CADPrompt

Open submissions. Once launched, any model can be submitted for evaluation. If you have a text-to-CAD model and want it on the leaderboard, get in touch.

Benchmark prompts

~200 prompts across 4 difficulty tiers. Fixed set for reproducible evaluation. Models are scored on validity rate, Chamfer distance, and VLM-judged prompt adherence.

Tier 1
Simple Primitives
~90%+ success
A cube 20 × 20 × 20 mm
A cylinder 10 mm diameter, 30 mm tall
A hollow sphere, outer radius 20 mm, wall 2 mm
Tier 2
Single Part with Features
~60–80% success
A rectangular plate 50 × 30 × 5 mm with a centered hole 8 mm diameter
An L-shaped bracket, 40 mm arms, 5 mm thick, 30 mm tall
A hex bolt head 10 mm across flats, M6 thread, 20 mm shaft
Tier 3
Multi-Feature Parts
~30–50% success
A flanged shaft with 3 equally-spaced M4 bolt holes on the flange
A box with a snap-fit lid, 50 × 40 × 30 mm
A spur gear: 20 teeth, module 2, 10 mm thick, 8 mm center bore
Tier 4
Complex Functional Parts
~5–20% success
A parametric living hinge, 100 mm span, 0.3 mm flex zone
An S-curve pipe fitting, 15 mm inner diameter, 45° bend
A 3-part snap-fit assembly: housing, PCB carrier, and lid

Preliminary results

EARLY DATA · 2026-03-03

20 prompts across 4 difficulty tiers. Metric: % of prompts that produced a valid, executable 3D part. Full leaderboard launching soon.

RANKMODELTYPEVALID STLSYNTAX OKAVG LATENCYPROMPTS PASSEDNOTES
#1Claude Opus 4.6LLM Baseline
90%
100%6.9s19 / 20Perfect T1–T3. Only tier 4 failures.
#1Zoo ML-ephantCommercial
95%
95%11.1s19 / 20Tied with Claude. Returns native geometry.
#3Gemini 2.5 FlashLLM Baseline
70%
100%3.1s14 / 20Fastest. Hallucinates methods at T4.
#4GPT-5LLM Baseline
60%
60%16.1s12 / 20Token truncation kills all T4 prompts.

These are API-only baseline results on 20 prompts. Full benchmark (200 prompts, 13+ models including academic open-source models) is in progress. Gemini result reflects free-tier rate limiting, not model quality.

Why this doesn't exist yet

The 2025 survey Large Language Models for Computer-Aided Design explicitly identifies this as the field's most critical gap.

No cross-model comparison

Sequence-based (Text2CAD), code-based (CAD-Coder), and B-rep direct (BrepGen) models are evaluated on different benchmarks with different metrics. You can't compare results across papers.

Academic ≠ commercial

Commercial tools like Zoo and AdamCAD are never included in academic benchmark tables. Academic models are never in commercial tool comparisons. Nobody has done both.

No living leaderboard

Every benchmark is a static snapshot tied to a paper. There's no place where new models submit and get ranked continuously — no SWE-bench equivalent for CAD.

No agreed-upon metrics

Unlike image generation (FID, CLIP score) or code (pass@k), CAD has no community-consensus quality metric. Papers pick different metrics, making progress hard to track.

PAPER IN PREPARATION

Accompanying publication

We are preparing a benchmark paper targeting NeurIPS 2026 Datasets & Benchmarks. The paper will evaluate all listed models on the fixed benchmark, propose standardized metrics, and describe the arena platform.

Get notified when the leaderboard launches and the preprint drops.

contact@cadarena.ai →

Are you working on a text-to-CAD model and want it on the leaderboard? We want to hear from you.