How Papers Evaluate Text-to-CAD

Compiled from 173 papers in our database — what metrics each paper uses, what prompts they test, what datasets they evaluate on, and what "good" actually means · March 2026

TL;DR — The 4 things every paper measures
All metrics defined
Evaluation datasets used
What prompts do papers test?
Paper-by-paper breakdown
The big comparison tables
What does "good" actually mean?
What no one measures yet

1. TL;DR — The 4 things every paper measures

Invalid Rate — % of outputs that fail to produce valid geometry at all

Chamfer Distance — geometric distance between generated shape and ground truth (×10³)

COV

Coverage — % of reference shapes well-approximated by at least one generated shape

VLM

Visual score — a VLM (Gemini / LLaVA) rates the rendered output 0–10 or votes which is better

The field is split by output format. Papers generating command sequences (DeepCAD-style) use Command Accuracy + CD. Papers generating CadQuery/FreeCAD code use Invalid Rate + CD. Both converge on CD as the common currency — but CD requires ground truth geometry, which means you need a curated dataset of correct answers for every prompt.

2. All Metrics Defined

Geometry Quality

Metric	What it measures	Direction	Who uses it
Chamfer Distance (CD)	Average nearest-neighbor distance between two point clouds sampled from the generated and reference shapes. Reported as CD ×10³.	Lower = better	Almost everyone
Hausdorff Distance	Maximum nearest-neighbor distance (worst-case outlier). More sensitive to geometric errors than CD.	Lower = better	CADCodeVerify
IoU	Volumetric overlap between generated and reference shapes, computed by voxelizing both. 1.0 = perfect match.	Higher = better	CAD-Recode, CADCodeVerify
IoGT	IoU specifically measured against the ground truth shape (variant in CADCodeVerify).	Higher = better	CADCodeVerify (ICLR 2025)
Point Cloud Distance	Direct average distance between point clouds (non-symmetric variant of CD).	Lower = better	CADCodeVerify

Validity / Executability

Metric	What it measures	Key numbers
Invalid Rate (IR)	% of outputs that fail to produce valid, non-degenerate geometry. The minimum bar — "did it even work."	GPT-4o zero-shot: 93%. Fine-tuned models: 1–7%.
Compile / Parsing Rate	% of generated code that executes without syntax or runtime error. Specific to code-gen papers.	CADCodeVerify: 96.5% with VLM feedback loop.
Physical Validity (PV)	% of outputs that produce a non-self-intersecting, watertight solid.	FlexCAD fine-tuned: 93.4%. GPT-4o raw: 48.9%.

For Command Sequence Models Only (DeepCAD lineage)

Metric	What it measures
Command Accuracy (ACC_cmd)	% of generated commands matching the ground truth command type (line, arc, circle, extrude, etc.). DeepCAD achieves 99.50% on reconstruction.
Parameter Accuracy (ACC_param)	% of predicted parameters within tolerance η=3 out of 256 quantization levels. DeepCAD achieves 97.98%.
Coverage (COV)	% of reference shapes well-approximated by at least one generated shape. Measures diversity of the output distribution.
Minimum Matching Distance (MMD)	Average CD of each reference shape to its nearest generated shape. Measures fidelity.
Jensen-Shannon Divergence (JSD)	Distribution-level similarity between generated and reference point cloud distributions.

Visual / Perceptual

Metric	What it measures	Used by
LVM / VLM Score (0–10)	A vision-language model (LLaVA, Gemini) rates the rendered CAD output on shape quality, quantity, and visual fidelity vs the text prompt. Not standardized across papers.	CADFusion (LLaVA scorer), Text-to-CadQuery (Gemini)
Gemini Visual Eval (%)	Gemini is shown the rendered STL and the original text prompt, asked if they match. Binary pass/fail per example.	Text2CAD, Text-to-CadQuery
Human Realism Score	Human annotators rate whether generated shapes look like realistic mechanical parts (not a specific match).	FlexCAD: 42.1% best model vs 12.8% GPT-4o

3. Evaluation Datasets

There are really only 3 datasets that matter for text-to-CAD evaluation:

Dataset	Size	Source	Format	Papers that eval on it
DeepCAD dataset	178,238 models	Onshape (real engineering CAD)	Command sequences (sketch + extrude)	DeepCAD, Text2CAD, CAD-GPT, CAD-Coder, CADFusion, FlexCAD, Text-to-CadQuery — basically everyone
Text-annotated DeepCAD	~20K text-CAD pairs	DeepCAD + VLM-generated captions (+ some human)	Sequences + text at 4 specificity levels	Text2CAD, Text-to-CadQuery, CADFusion
Fusion 360 Gallery	8,625 models	Autodesk	B-rep + mesh	CAD-Recode (secondary eval set)
CC3D (real scans)	Real-world point clouds	3D scans of physical objects	Point clouds	CAD-Recode — tests generalization to messy real-world data
Custom hand-crafted benchmarks	10–100 prompts	Paper authors	Text prompts	Kumar et al. 2026 (10 levels), CADCodeVerify (50 prompts), CAD Arena (our 20 prompts)

The big limitation of DeepCAD: Only sketch-and-extrude operations. No fillets, chamfers, sweeps, lofts, patterns. Every model evaluated on it is only tested on simple prismatic shapes. When CAD-Recode tests on real scans (CC3D), IoU drops from 92% → 74%. The standard benchmark is easier than real-world engineering.

4. What Prompts Do Papers Actually Test?

Text2CAD (NeurIPS 2024) — 4 abstraction levels per shape

Same shape described at 4 specificity levels. All 4 tested and reported separately. Key insight: detailed prompts score much better.

Level 1 — Abstract

"A mechanical bracket"

Level 2 — Simplified

"A rectangular bracket with a mounting hole"

Level 3 — Generalized geometric

"A flat rectangular plate with a circular hole centered near one end, used for mounting"

Level 4 — Detailed geometric (with dimensions)

"A rectangular extruded plate 80mm × 40mm × 5mm with a 10mm diameter circular hole centered 15mm from one end, with chamfered top edges at 45 degrees"

Result: abstract prompts → 42.3% Gemini visual pass. Detailed prompts → 71.4%. More info = better output.

Kumar et al. 2026 — 10 complexity levels with GPT-4 + FreeCAD

Manual benchmark showing exactly where LLMs break down:

#	Task	Result	Attempts	Time
1	Basic cube 50mm	✓ Success	1	19s
2	Cylinder r=25, h=60mm	✓ Success	1	20s
3	Rectangular box with 5mm fillets	✓ Success	2	42s
4	Boolean union (box + cylinder)	✓ Success	1	22s
5	Box with cylindrical hole subtracted	✓ Success	1	23s
6	Parametric plate with 4 corner holes	✓ Success	1	28s
7	Parametric hinge with multiple constrained segments	✓ Success	3	54s
8	Involute gear, 20 teeth, module 2mm	✗ FAILED	50 (max)	836s
9	L-plate with complex cutouts	✓ Success	3	81s
10	Fully constrained structural frame with ribs	✗ FAILED	50 (max)	909s

CADCodeVerify (ICLR 2025) — 50 hand-crafted engineering prompts

Simple end

"Generate a hollow cylinder with inner radius 15mm, outer radius 20mm, height 40mm"

Complex end

"Create a hexagonal bolt head with M10 thread specification, standard DIN933 dimensions, with a through-hole for the shaft"

CAD-GPT (AAAI 2025) — auto-generated from DeepCAD test shapes

VLMs (Gemini, LLaVA) generate text descriptions of test-set shapes automatically. Model must reconstruct those shapes from the descriptions. Ensures test distribution matches training distribution.

Example auto-generated description

"A symmetric part with two cylindrical protrusions on each side of a flat base, extruded 12 units tall, with a central rectangular slot running the full length"

CAD Arena (our benchmark) — 20 prompts across 4 tiers

Tier	Description	Example	Claude	GPT-5	Zoo
T1 — Simple primitives	One shape	"A cube with side length 20mm"	4/5	5/5	5/5
T2 — Single part + features	One shape with holes/fillets	"A cylinder with a centered through-hole"	5/5	4/5	4/5
T3 — Multi-feature	Several operations combined	"An L-bracket with 3 mounting holes"	4/5	3/5	5/5
T4 — Complex functional	Specialized geometry	"A spur gear with 24 teeth, module 2"	3/5	2/5	2/5

5. Paper-by-Paper Breakdown

ICCV 2021

DeepCAD — Deep Generative Network for CAD Models

Wu et al. · arXiv:2105.09492

The foundational paper. Trains a Transformer autoencoder on 178K CAD command sequences from Onshape. Created the dataset and evaluation protocol the entire field uses.

ACC_cmdACC_paramChamfer DistanceInvalid RateCoverageMMDJSD

Task: Autoencoding (encode → latent → decode) + unconditional generation. No text prompts.

Method	ACC_cmd (%)	ACC_param (%)	Median CD (×10³)	IR (%)
DeepCAD + Augmentation	99.50	97.98	0.752	2.72
DeepCAD (no aug)	99.36	97.47	0.787	3.30
Alt-Regression baseline	—	—	2.142	4.32

Also reports generation diversity: COV=78.13, JSD=3.76. "Comparable to point-cloud generative models while producing sharp, editable CAD." This is what every subsequent paper benchmarks against.

NeurIPS 2024

Text2CAD — Sequential CAD Designs from Beginner-to-Expert Text

Khan et al. · arXiv:2409.17106

First major text-to-CAD paper. 363M model generates DeepCAD command sequences from text. Created the text-annotated DeepCAD dataset (4 levels per shape) that subsequent papers reuse for direct comparison.

Median CDMean CDInvalid RateGemini Visual Score (%)

Gemini eval: Renders the STL, shows image + original text prompt to Gemini, asks "does this match?" Reports % pass per text abstraction level.

Prompt level	Median CD (×10³)	Mean CD	IR (%)	Gemini Visual (%)
All levels combined	0.370	26.42	3.5	58.80
Abstract only	0.520	—	5.1	42.30
Detailed geometric	0.280	—	2.1	71.40

Key finding: Detailed prompts (71.4% visual pass) vs abstract (42.3%). More spec = better output. Directly validates our 4-tier benchmark design.

arXiv 2025

Text-to-CadQuery — CadQuery Code Generation from Text

arXiv:2505.06507

Qwen2.5-3B and CodeGPT-small (124M) generate CadQuery Python code instead of command sequences. Uses the same text-annotated DeepCAD test set as Text2CAD for direct comparison. Code generation wins at 10× fewer parameters.

Median CDMean CDInvalid RateGemini Visual Score (%)

Method	Params	Median CD	Mean CD	IR (%)	Gemini Visual
Text-to-CadQuery (Qwen2.5-3B)	3B	0.191	10.23	6.5	69.3%
Text-to-CadQuery (CodeGPT-small)	124M	0.234	13.52	—	60.3%
Text2CAD (command sequences, prior SOTA)	363M	0.370	26.42	3.5	58.8%

Key finding: A 124M model generating CadQuery code beats a 363M model generating command sequences. CadQuery is the right output format — LLMs already know Python syntax from pre-training.

arXiv 2025

CAD-Coder — Chain-of-Thought + Geometric RL Reward

arXiv:2505.19713

Qwen2.5-7B fine-tuned in two stages: SFT on text-code pairs, then GRPO (Group Relative Policy Optimization) with a geometric reward based on Chamfer Distance. Shows the single biggest performance jump in the field — 91% improvement from RL.

Mean CDInvalid Rate

Reward design: Execute the generated code → compute CD vs ground truth → use CD as RL reward. Training cost: 146 hours on 8× A800 GPUs for GRPO stage.

Training regime	Mean CD (×10³)	IR (%)
SFT only (no RL)	74.55	—
SFT + GRPO (geometric reward)	6.54	1.45
GPT-4o zero-shot	133.52	93.0

Key finding: GRPO reduces Mean CD 74.55 → 6.54 (91% improvement). RL with geometric feedback is the biggest lever in the field right now.

ICML 2025

CADFusion — Text-to-CAD with Iterative Visual Feedback Training

Wang et al. · arXiv:2501.19054

LLaMA-3-8B with two alternating training stages: supervised learning on text-CAD pairs, then DPO with a VLM scorer (LLaVA) generating preference pairs. Alternates 5 iterations to prevent skill degradation.

LVM Score (0–10)Mean CDInvalid Rate

VLM scorer: LLaVA-OneVision-Qwen2-7B rates rendered outputs on shape quality, quantity, and distribution. ~1,500 preference pairs per DPO iteration from 1,000 prompts.

Method	LVM Score (/10)	Mean CD (×10³)	IR (%)
VLM-annotated (no human captions)	6.56	—	—
Human-annotated SL only	7.69	—	—
CADFusion (SL + 1 VF iter)	8.28	—	—
CADFusion (SL + 3 VF iter)	8.76	—	—
CADFusion (SL + 5 VF iter)	8.96	19.89	6.20
GPT-4o zero-shot	5.13	133.52	93.0

Note on LVM Score: CADFusion uses this as primary metric (not CD) because CD doesn't capture "does it look right." The LVM scorer approximates human visual judgment and is more aligned with what users actually want.

AAAI 2025

CAD-GPT — Multimodal LLM with Spatial Reasoning

arXiv:2412.19663

LLaVA-1.5-7B augmented with spatial localization tokens (3D coordinates → 1D tokens). Takes image OR text as input. Key result: image-to-CAD is 3× easier than text-to-CAD.

Invalid RateMedian CDACC_cmdACC_param

Method	Input	IR (%)	Median CD (×10³)	ACC_cmd	ACC_param
CAD-GPT	Image	1.61	9.77	99.21	98.87
HNC-CAD (prior best)	Image	18.64	18.64	—	—
GPT-4 few-shot	Image	64.37	62.64	—	—
CAD-GPT	Text	7.43	28.33	98.73	98.12
GPT-4 few-shot	Text	76.97	187.52	—	—
LLaMA-3.1 few-shot	Text	98.68	—	—	—

Key finding: Image-to-CAD (CD=9.77) is 3× easier than text-to-CAD (CD=28.33). LLaMA-3.1 zero-shot: 98.68% IR — essentially always fails without fine-tuning.

ICCV 2025

CAD-Recode — Reverse-Engineering CadQuery from Point Clouds

Rukhovich et al. · arXiv:2412.14042

Different task: input = point cloud (from a scan), output = CadQuery code. Qwen2-1.5B + lightweight point cloud projector. Trained on 1M procedurally generated scripts (no human annotation). 10× improvement over prior methods.

Chamfer DistanceIoUInvalid Rate

Dataset	Mean CD	Median CD	IoU (%)	IR (%)
DeepCAD test	0.30	0.16	92.0	0.4
Fusion 360 test	0.35	0.15	87.8	0.5
CC3D (real scans)	0.76	0.31	74.2	0.3
Prior SOTA (CAD-SIGNet)	3.33	2.36	81.5	—

10× improvement. Real scans drop IoU from 92% → 74% — real-world is harder but still impressive. Key insight: 1M synthetic CadQuery scripts (procedurally generated) can substitute for expensive human annotation.

ICLR 2025

FlexCAD — Unified Controllable CAD Generation with Fine-Tuned LLMs

Microsoft · arXiv:2411.05823

LLaMA-3-8B + LoRA, trained with hierarchy-aware masking. Can be conditioned at any level of the CAD hierarchy: sketch, extrusion, face, loop, or curve.

Coverage (COV)MMDJSDPhysical Validity (PV)Human Realism Score

Method	COV (%)	PV (%)	Human realism (%)
FlexCAD (extrusion-level)	68.5	93.3	42.1
FlexCAD (sketch-level)	65.6	93.4	39.6
SkexGen	55.2	72.6	21.3
GPT-4o (no fine-tune)	40.1	48.9	12.8

ICLR 2025

Generating CAD Code with VLMs — Automated Verification

arXiv:2410.05340

GPT-4 generates CAD code (OpenSCAD), a VLM renders it and asks Yes/No verification questions, feeds corrective feedback back. Custom 50-prompt benchmark. Achieves within 5% of human-in-the-loop.

Compile RateIoGTPoint Cloud DistanceHausdorff Distance

Method	Compile Rate (%)	IoGT	PC Distance	Hausdorff Dist
GPT-4 + CADCodeVerify	96.5	0.944	0.127	0.419
GPT-4 + 3D-Premise	91.0	0.921	0.137	0.452
GPT-4 (no refinement)	91.0	0.912	0.142	—
Human-in-the-loop (upper bound)	—	—	0.120	—

Automated VLM verification gets within 5% of a human reviewer. Loop: render → generate Yes/No questions → answer with chain-of-thought → fix code → repeat.

6. The Big Comparison Tables

All text-to-CAD methods ranked by Median CD

Rank	Method	Year	Output	Params	Median CD (×10³)	Mean CD	IR (%)	Visual
1	CAD-Coder (GRPO)	2025	CadQuery	7B	~0.17	6.54	1.45	—
2	Text-to-CadQuery (Qwen2.5-3B)	2025	CadQuery	3B	0.191	10.23	6.5	69.3% Gemini
3	Text-to-CadQuery (CodeGPT-small)	2025	CadQuery	124M	0.234	13.52	—	60.3% Gemini
4	Text2CAD	2024	Cmd seq	363M	0.370	26.42	3.5	58.8% Gemini
5	CADFusion (5-iter DPO)	2025	Cmd seq	8B	—	19.89	6.2	8.96/10 LVM
6	CAD-GPT (text input)	2025	Cmd seq	7B	28.33	—	7.43	—
—	GPT-4o zero-shot	2024	Code	—	—	133.52	93.0	5.13/10 LVM
—	Claude-3.7 zero-shot	2025	Code	—	—	186.53	47.0	—
—	DeepSeek-V3 zero-shot	2025	Code	—	—	186.69	52.0	—

CAD Arena Run 1 (our data — 20 prompts × 4 models)

Model	API calls	STL success	STL %	Avg latency	Note
Claude Opus 4.6	20/20	16/20	80%	8.6s	Zero-shot, no fine-tuning
Zoo ML-ephant	16/20	16/20	80%	64.8s	Proprietary; returns STL directly
GPT-5	20/20	14/20	70%	23.3s	Zero-shot, no fine-tuning
Gemini 2.5 Flash	7/20	7/20	35%	~3s	Rate limited — not model failure

Our "STL %" = execution success rate = inverted Invalid Rate. We do NOT yet have Chamfer Distance — that requires ground truth STL files (Phase 3 in NEXT-STEPS.md).

7. What Does "Good" Actually Mean?

Definition	How measured	Limitation
It executes	Invalid Rate, Compile Rate	0% IR doesn't mean correct shapes
Geometrically close to ground truth	Chamfer Distance	Requires ground truth. Misses editability, intent, manufacturing.
Visually matches the prompt	VLM/Gemini score	Subjective, not standardized across papers
Looks like a realistic part	Human realism study	Expensive, high variance
Correct operations in sequence	ACC_cmd, ACC_param	Command-sequence models only
Covers the design space	COV, MMD, JSD	Diversity metric, not prompt-following accuracy

"There is no agreed-upon benchmark for text-to-CAD. Chamfer Distance measures geometric similarity but not parametric editability, manufacturing feasibility, or constraint satisfaction." — LLM survey (Zhang et al., 2025) and GDL survey (Heidari & Iosifidis, 2025), independently

Best current approach: combine low IR (it runs) + low CD (geometry matches) + high VLM score (looks right). No paper does all three perfectly — and none measure editability, manufacturability, or constraint correctness.

8. What No One Measures Yet

Gap	Why it matters	Status
No standard prompt set	Every paper uses different prompts — results can't be directly compared across papers.	CAD Arena is designed to fix this.
CD ≠ design quality	Can have low CD but mechanically wrong shapes.	VLM scores used as proxy.
No editability metric	Core value of CAD is "change a dimension, model updates." Never tested.	Not measured anywhere.
No manufacturing feasibility	Generated parts often have impossible tolerances or non-machinable geometry.	Not measured anywhere.
DeepCAD too simple	Only sketch-and-extrude — not representative of real engineering parts.	CAD-Recode adds CC3D real scans.
No assembly evaluation	All eval is single-part. Real products are multi-part assemblies.	Not addressed by any paper.
Human preference / Elo	Most meaningful signal for "which model is better." What LLM Arena uses.	This is what the Arena page would provide.

What CAD Arena has vs. what's missing

✓ Invalid Rate (execution success) ✓ 4-tier difficulty structure ✓ API vs academic model comparison ✓ Per-model, per-tier breakdown ✗ Chamfer Distance (needs ground truth STLs) ✗ VLM visual score (easy to add with Gemini) ✗ Human preference votes (the arena part) ✗ Editability / manufacturability metrics

Compiled from 173 papers in feb8/papers-database.md · Deep analysis in feb8/report.md · CAD Arena benchmark results from eval/results/20260303_210402 · March 2026

How Papers Evaluate Text-to-CAD

Contents

1. TL;DR — The 4 things every paper measures

2. All Metrics Defined

Geometry Quality

Validity / Executability

For Command Sequence Models Only (DeepCAD lineage)

Visual / Perceptual

3. Evaluation Datasets

4. What Prompts Do Papers Actually Test?

Text2CAD (NeurIPS 2024) — 4 abstraction levels per shape

Kumar et al. 2026 — 10 complexity levels with GPT-4 + FreeCAD

CADCodeVerify (ICLR 2025) — 50 hand-crafted engineering prompts

CAD-GPT (AAAI 2025) — auto-generated from DeepCAD test shapes

CAD Arena (our benchmark) — 20 prompts across 4 tiers

5. Paper-by-Paper Breakdown

6. The Big Comparison Tables

All text-to-CAD methods ranked by Median CD

CAD Arena Run 1 (our data — 20 prompts × 4 models)

7. What Does "Good" Actually Mean?

8. What No One Measures Yet

What CAD Arena has vs. what's missing