How Papers Evaluate Text-to-CAD

Compiled from 173 papers in our database — what metrics each paper uses, what prompts they test, what datasets they evaluate on, and what "good" actually means · March 2026

Contents

  1. TL;DR — The 4 things every paper measures
  2. All metrics defined
  3. Evaluation datasets used
  4. What prompts do papers test?
  5. Paper-by-paper breakdown
  6. The big comparison tables
  7. What does "good" actually mean?
  8. What no one measures yet

1. TL;DR — The 4 things every paper measures

IR
Invalid Rate — % of outputs that fail to produce valid geometry at all
CD
Chamfer Distance — geometric distance between generated shape and ground truth (×10³)
COV
Coverage — % of reference shapes well-approximated by at least one generated shape
VLM
Visual score — a VLM (Gemini / LLaVA) rates the rendered output 0–10 or votes which is better
The field is split by output format. Papers generating command sequences (DeepCAD-style) use Command Accuracy + CD. Papers generating CadQuery/FreeCAD code use Invalid Rate + CD. Both converge on CD as the common currency — but CD requires ground truth geometry, which means you need a curated dataset of correct answers for every prompt.

2. All Metrics Defined

Geometry Quality

MetricWhat it measuresDirectionWho uses it
Chamfer Distance (CD) Average nearest-neighbor distance between two point clouds sampled from the generated and reference shapes. Reported as CD ×10³. Lower = better Almost everyone
Hausdorff Distance Maximum nearest-neighbor distance (worst-case outlier). More sensitive to geometric errors than CD. Lower = better CADCodeVerify
IoU Volumetric overlap between generated and reference shapes, computed by voxelizing both. 1.0 = perfect match. Higher = better CAD-Recode, CADCodeVerify
IoGT IoU specifically measured against the ground truth shape (variant in CADCodeVerify). Higher = better CADCodeVerify (ICLR 2025)
Point Cloud Distance Direct average distance between point clouds (non-symmetric variant of CD). Lower = better CADCodeVerify

Validity / Executability

MetricWhat it measuresKey numbers
Invalid Rate (IR) % of outputs that fail to produce valid, non-degenerate geometry. The minimum bar — "did it even work." GPT-4o zero-shot: 93%. Fine-tuned models: 1–7%.
Compile / Parsing Rate % of generated code that executes without syntax or runtime error. Specific to code-gen papers. CADCodeVerify: 96.5% with VLM feedback loop.
Physical Validity (PV) % of outputs that produce a non-self-intersecting, watertight solid. FlexCAD fine-tuned: 93.4%. GPT-4o raw: 48.9%.

For Command Sequence Models Only (DeepCAD lineage)

MetricWhat it measures
Command Accuracy (ACC_cmd) % of generated commands matching the ground truth command type (line, arc, circle, extrude, etc.). DeepCAD achieves 99.50% on reconstruction.
Parameter Accuracy (ACC_param) % of predicted parameters within tolerance η=3 out of 256 quantization levels. DeepCAD achieves 97.98%.
Coverage (COV) % of reference shapes well-approximated by at least one generated shape. Measures diversity of the output distribution.
Minimum Matching Distance (MMD) Average CD of each reference shape to its nearest generated shape. Measures fidelity.
Jensen-Shannon Divergence (JSD) Distribution-level similarity between generated and reference point cloud distributions.

Visual / Perceptual

MetricWhat it measuresUsed by
LVM / VLM Score (0–10) A vision-language model (LLaVA, Gemini) rates the rendered CAD output on shape quality, quantity, and visual fidelity vs the text prompt. Not standardized across papers. CADFusion (LLaVA scorer), Text-to-CadQuery (Gemini)
Gemini Visual Eval (%) Gemini is shown the rendered STL and the original text prompt, asked if they match. Binary pass/fail per example. Text2CAD, Text-to-CadQuery
Human Realism Score Human annotators rate whether generated shapes look like realistic mechanical parts (not a specific match). FlexCAD: 42.1% best model vs 12.8% GPT-4o

3. Evaluation Datasets

There are really only 3 datasets that matter for text-to-CAD evaluation:

DatasetSizeSourceFormatPapers that eval on it
DeepCAD dataset 178,238 models Onshape (real engineering CAD) Command sequences (sketch + extrude) DeepCAD, Text2CAD, CAD-GPT, CAD-Coder, CADFusion, FlexCAD, Text-to-CadQuery — basically everyone
Text-annotated DeepCAD ~20K text-CAD pairs DeepCAD + VLM-generated captions (+ some human) Sequences + text at 4 specificity levels Text2CAD, Text-to-CadQuery, CADFusion
Fusion 360 Gallery 8,625 models Autodesk B-rep + mesh CAD-Recode (secondary eval set)
CC3D (real scans) Real-world point clouds 3D scans of physical objects Point clouds CAD-Recode — tests generalization to messy real-world data
Custom hand-crafted benchmarks 10–100 prompts Paper authors Text prompts Kumar et al. 2026 (10 levels), CADCodeVerify (50 prompts), CAD Arena (our 20 prompts)
The big limitation of DeepCAD: Only sketch-and-extrude operations. No fillets, chamfers, sweeps, lofts, patterns. Every model evaluated on it is only tested on simple prismatic shapes. When CAD-Recode tests on real scans (CC3D), IoU drops from 92% → 74%. The standard benchmark is easier than real-world engineering.

4. What Prompts Do Papers Actually Test?

Text2CAD (NeurIPS 2024) — 4 abstraction levels per shape

Same shape described at 4 specificity levels. All 4 tested and reported separately. Key insight: detailed prompts score much better.

Level 1 — Abstract
"A mechanical bracket"
Level 2 — Simplified
"A rectangular bracket with a mounting hole"
Level 3 — Generalized geometric
"A flat rectangular plate with a circular hole centered near one end, used for mounting"
Level 4 — Detailed geometric (with dimensions)
"A rectangular extruded plate 80mm × 40mm × 5mm with a 10mm diameter circular hole centered 15mm from one end, with chamfered top edges at 45 degrees"

Result: abstract prompts → 42.3% Gemini visual pass. Detailed prompts → 71.4%. More info = better output.

Kumar et al. 2026 — 10 complexity levels with GPT-4 + FreeCAD

Manual benchmark showing exactly where LLMs break down:

#TaskResultAttemptsTime
1Basic cube 50mm✓ Success119s
2Cylinder r=25, h=60mm✓ Success120s
3Rectangular box with 5mm fillets✓ Success242s
4Boolean union (box + cylinder)✓ Success122s
5Box with cylindrical hole subtracted✓ Success123s
6Parametric plate with 4 corner holes✓ Success128s
7Parametric hinge with multiple constrained segments✓ Success354s
8Involute gear, 20 teeth, module 2mm✗ FAILED50 (max)836s
9L-plate with complex cutouts✓ Success381s
10Fully constrained structural frame with ribs✗ FAILED50 (max)909s

CADCodeVerify (ICLR 2025) — 50 hand-crafted engineering prompts

Simple end
"Generate a hollow cylinder with inner radius 15mm, outer radius 20mm, height 40mm"
Complex end
"Create a hexagonal bolt head with M10 thread specification, standard DIN933 dimensions, with a through-hole for the shaft"

CAD-GPT (AAAI 2025) — auto-generated from DeepCAD test shapes

VLMs (Gemini, LLaVA) generate text descriptions of test-set shapes automatically. Model must reconstruct those shapes from the descriptions. Ensures test distribution matches training distribution.

Example auto-generated description
"A symmetric part with two cylindrical protrusions on each side of a flat base, extruded 12 units tall, with a central rectangular slot running the full length"

CAD Arena (our benchmark) — 20 prompts across 4 tiers

TierDescriptionExampleClaudeGPT-5Zoo
T1 — Simple primitivesOne shape"A cube with side length 20mm"4/55/55/5
T2 — Single part + featuresOne shape with holes/fillets"A cylinder with a centered through-hole"5/54/54/5
T3 — Multi-featureSeveral operations combined"An L-bracket with 3 mounting holes"4/53/55/5
T4 — Complex functionalSpecialized geometry"A spur gear with 24 teeth, module 2"3/52/52/5

5. Paper-by-Paper Breakdown

ICCV 2021
DeepCAD — Deep Generative Network for CAD Models
Wu et al. · arXiv:2105.09492

The foundational paper. Trains a Transformer autoencoder on 178K CAD command sequences from Onshape. Created the dataset and evaluation protocol the entire field uses.

ACC_cmdACC_paramChamfer DistanceInvalid RateCoverageMMDJSD

Task: Autoencoding (encode → latent → decode) + unconditional generation. No text prompts.

MethodACC_cmd (%)ACC_param (%)Median CD (×10³)IR (%)
DeepCAD + Augmentation99.5097.980.7522.72
DeepCAD (no aug)99.3697.470.7873.30
Alt-Regression baseline2.1424.32
Also reports generation diversity: COV=78.13, JSD=3.76. "Comparable to point-cloud generative models while producing sharp, editable CAD." This is what every subsequent paper benchmarks against.
NeurIPS 2024
Text2CAD — Sequential CAD Designs from Beginner-to-Expert Text
Khan et al. · arXiv:2409.17106

First major text-to-CAD paper. 363M model generates DeepCAD command sequences from text. Created the text-annotated DeepCAD dataset (4 levels per shape) that subsequent papers reuse for direct comparison.

Median CDMean CDInvalid RateGemini Visual Score (%)

Gemini eval: Renders the STL, shows image + original text prompt to Gemini, asks "does this match?" Reports % pass per text abstraction level.

Prompt levelMedian CD (×10³)Mean CDIR (%)Gemini Visual (%)
All levels combined0.37026.423.558.80
Abstract only0.5205.142.30
Detailed geometric0.2802.171.40
Key finding: Detailed prompts (71.4% visual pass) vs abstract (42.3%). More spec = better output. Directly validates our 4-tier benchmark design.
arXiv 2025
Text-to-CadQuery — CadQuery Code Generation from Text
arXiv:2505.06507

Qwen2.5-3B and CodeGPT-small (124M) generate CadQuery Python code instead of command sequences. Uses the same text-annotated DeepCAD test set as Text2CAD for direct comparison. Code generation wins at 10× fewer parameters.

Median CDMean CDInvalid RateGemini Visual Score (%)
MethodParamsMedian CDMean CDIR (%)Gemini Visual
Text-to-CadQuery (Qwen2.5-3B)3B0.19110.236.569.3%
Text-to-CadQuery (CodeGPT-small)124M0.23413.5260.3%
Text2CAD (command sequences, prior SOTA)363M0.37026.423.558.8%
Key finding: A 124M model generating CadQuery code beats a 363M model generating command sequences. CadQuery is the right output format — LLMs already know Python syntax from pre-training.
arXiv 2025
CAD-Coder — Chain-of-Thought + Geometric RL Reward
arXiv:2505.19713

Qwen2.5-7B fine-tuned in two stages: SFT on text-code pairs, then GRPO (Group Relative Policy Optimization) with a geometric reward based on Chamfer Distance. Shows the single biggest performance jump in the field — 91% improvement from RL.

Mean CDInvalid Rate

Reward design: Execute the generated code → compute CD vs ground truth → use CD as RL reward. Training cost: 146 hours on 8× A800 GPUs for GRPO stage.

Training regimeMean CD (×10³)IR (%)
SFT only (no RL)74.55
SFT + GRPO (geometric reward)6.541.45
GPT-4o zero-shot133.5293.0
Key finding: GRPO reduces Mean CD 74.55 → 6.54 (91% improvement). RL with geometric feedback is the biggest lever in the field right now.
ICML 2025
CADFusion — Text-to-CAD with Iterative Visual Feedback Training
Wang et al. · arXiv:2501.19054

LLaMA-3-8B with two alternating training stages: supervised learning on text-CAD pairs, then DPO with a VLM scorer (LLaVA) generating preference pairs. Alternates 5 iterations to prevent skill degradation.

LVM Score (0–10)Mean CDInvalid Rate

VLM scorer: LLaVA-OneVision-Qwen2-7B rates rendered outputs on shape quality, quantity, and distribution. ~1,500 preference pairs per DPO iteration from 1,000 prompts.

MethodLVM Score (/10)Mean CD (×10³)IR (%)
VLM-annotated (no human captions)6.56
Human-annotated SL only7.69
CADFusion (SL + 1 VF iter)8.28
CADFusion (SL + 3 VF iter)8.76
CADFusion (SL + 5 VF iter)8.9619.896.20
GPT-4o zero-shot5.13133.5293.0
Note on LVM Score: CADFusion uses this as primary metric (not CD) because CD doesn't capture "does it look right." The LVM scorer approximates human visual judgment and is more aligned with what users actually want.
AAAI 2025
CAD-GPT — Multimodal LLM with Spatial Reasoning
arXiv:2412.19663

LLaVA-1.5-7B augmented with spatial localization tokens (3D coordinates → 1D tokens). Takes image OR text as input. Key result: image-to-CAD is 3× easier than text-to-CAD.

Invalid RateMedian CDACC_cmdACC_param
MethodInputIR (%)Median CD (×10³)ACC_cmdACC_param
CAD-GPTImage1.619.7799.2198.87
HNC-CAD (prior best)Image18.6418.64
GPT-4 few-shotImage64.3762.64
CAD-GPTText7.4328.3398.7398.12
GPT-4 few-shotText76.97187.52
LLaMA-3.1 few-shotText98.68
Key finding: Image-to-CAD (CD=9.77) is 3× easier than text-to-CAD (CD=28.33). LLaMA-3.1 zero-shot: 98.68% IR — essentially always fails without fine-tuning.
ICCV 2025
CAD-Recode — Reverse-Engineering CadQuery from Point Clouds
Rukhovich et al. · arXiv:2412.14042

Different task: input = point cloud (from a scan), output = CadQuery code. Qwen2-1.5B + lightweight point cloud projector. Trained on 1M procedurally generated scripts (no human annotation). 10× improvement over prior methods.

Chamfer DistanceIoUInvalid Rate
DatasetMean CDMedian CDIoU (%)IR (%)
DeepCAD test0.300.1692.00.4
Fusion 360 test0.350.1587.80.5
CC3D (real scans)0.760.3174.20.3
Prior SOTA (CAD-SIGNet)3.332.3681.5
10× improvement. Real scans drop IoU from 92% → 74% — real-world is harder but still impressive. Key insight: 1M synthetic CadQuery scripts (procedurally generated) can substitute for expensive human annotation.
ICLR 2025
FlexCAD — Unified Controllable CAD Generation with Fine-Tuned LLMs
Microsoft · arXiv:2411.05823

LLaMA-3-8B + LoRA, trained with hierarchy-aware masking. Can be conditioned at any level of the CAD hierarchy: sketch, extrusion, face, loop, or curve.

Coverage (COV)MMDJSDPhysical Validity (PV)Human Realism Score
MethodCOV (%)PV (%)Human realism (%)
FlexCAD (extrusion-level)68.593.342.1
FlexCAD (sketch-level)65.693.439.6
SkexGen55.272.621.3
GPT-4o (no fine-tune)40.148.912.8
ICLR 2025
Generating CAD Code with VLMs — Automated Verification
arXiv:2410.05340

GPT-4 generates CAD code (OpenSCAD), a VLM renders it and asks Yes/No verification questions, feeds corrective feedback back. Custom 50-prompt benchmark. Achieves within 5% of human-in-the-loop.

Compile RateIoGTPoint Cloud DistanceHausdorff Distance
MethodCompile Rate (%)IoGTPC DistanceHausdorff Dist
GPT-4 + CADCodeVerify96.50.9440.1270.419
GPT-4 + 3D-Premise91.00.9210.1370.452
GPT-4 (no refinement)91.00.9120.142
Human-in-the-loop (upper bound)0.120
Automated VLM verification gets within 5% of a human reviewer. Loop: render → generate Yes/No questions → answer with chain-of-thought → fix code → repeat.

6. The Big Comparison Tables

All text-to-CAD methods ranked by Median CD

RankMethodYearOutputParamsMedian CD (×10³)Mean CDIR (%)Visual
1CAD-Coder (GRPO)2025CadQuery7B~0.176.541.45
2Text-to-CadQuery (Qwen2.5-3B)2025CadQuery3B0.19110.236.569.3% Gemini
3Text-to-CadQuery (CodeGPT-small)2025CadQuery124M0.23413.5260.3% Gemini
4Text2CAD2024Cmd seq363M0.37026.423.558.8% Gemini
5CADFusion (5-iter DPO)2025Cmd seq8B19.896.28.96/10 LVM
6CAD-GPT (text input)2025Cmd seq7B28.337.43
GPT-4o zero-shot2024Code133.5293.05.13/10 LVM
Claude-3.7 zero-shot2025Code186.5347.0
DeepSeek-V3 zero-shot2025Code186.6952.0

CAD Arena Run 1 (our data — 20 prompts × 4 models)

ModelAPI callsSTL successSTL %Avg latencyNote
Claude Opus 4.620/2016/2080%8.6sZero-shot, no fine-tuning
Zoo ML-ephant16/2016/2080%64.8sProprietary; returns STL directly
GPT-520/2014/2070%23.3sZero-shot, no fine-tuning
Gemini 2.5 Flash7/207/2035%~3sRate limited — not model failure
Our "STL %" = execution success rate = inverted Invalid Rate. We do NOT yet have Chamfer Distance — that requires ground truth STL files (Phase 3 in NEXT-STEPS.md).

7. What Does "Good" Actually Mean?

DefinitionHow measuredLimitation
It executesInvalid Rate, Compile Rate0% IR doesn't mean correct shapes
Geometrically close to ground truthChamfer DistanceRequires ground truth. Misses editability, intent, manufacturing.
Visually matches the promptVLM/Gemini scoreSubjective, not standardized across papers
Looks like a realistic partHuman realism studyExpensive, high variance
Correct operations in sequenceACC_cmd, ACC_paramCommand-sequence models only
Covers the design spaceCOV, MMD, JSDDiversity metric, not prompt-following accuracy
"There is no agreed-upon benchmark for text-to-CAD. Chamfer Distance measures geometric similarity but not parametric editability, manufacturing feasibility, or constraint satisfaction." — LLM survey (Zhang et al., 2025) and GDL survey (Heidari & Iosifidis, 2025), independently

Best current approach: combine low IR (it runs) + low CD (geometry matches) + high VLM score (looks right). No paper does all three perfectly — and none measure editability, manufacturability, or constraint correctness.

8. What No One Measures Yet

GapWhy it mattersStatus
No standard prompt setEvery paper uses different prompts — results can't be directly compared across papers.CAD Arena is designed to fix this.
CD ≠ design qualityCan have low CD but mechanically wrong shapes.VLM scores used as proxy.
No editability metricCore value of CAD is "change a dimension, model updates." Never tested.Not measured anywhere.
No manufacturing feasibilityGenerated parts often have impossible tolerances or non-machinable geometry.Not measured anywhere.
DeepCAD too simpleOnly sketch-and-extrude — not representative of real engineering parts.CAD-Recode adds CC3D real scans.
No assembly evaluationAll eval is single-part. Real products are multi-part assemblies.Not addressed by any paper.
Human preference / EloMost meaningful signal for "which model is better." What LLM Arena uses.This is what the Arena page would provide.

What CAD Arena has vs. what's missing

✓ Invalid Rate (execution success) ✓ 4-tier difficulty structure ✓ API vs academic model comparison ✓ Per-model, per-tier breakdown ✗ Chamfer Distance (needs ground truth STLs) ✗ VLM visual score (easy to add with Gemini) ✗ Human preference votes (the arena part) ✗ Editability / manufacturability metrics

Compiled from 173 papers in feb8/papers-database.md · Deep analysis in feb8/report.md · CAD Arena benchmark results from eval/results/20260303_210402 · March 2026