Compiled from 173 papers in our database — what metrics each paper uses, what prompts they test, what datasets they evaluate on, and what "good" actually means · March 2026
| Metric | What it measures | Direction | Who uses it |
|---|---|---|---|
| Chamfer Distance (CD) | Average nearest-neighbor distance between two point clouds sampled from the generated and reference shapes. Reported as CD ×10³. | Lower = better | Almost everyone |
| Hausdorff Distance | Maximum nearest-neighbor distance (worst-case outlier). More sensitive to geometric errors than CD. | Lower = better | CADCodeVerify |
| IoU | Volumetric overlap between generated and reference shapes, computed by voxelizing both. 1.0 = perfect match. | Higher = better | CAD-Recode, CADCodeVerify |
| IoGT | IoU specifically measured against the ground truth shape (variant in CADCodeVerify). | Higher = better | CADCodeVerify (ICLR 2025) |
| Point Cloud Distance | Direct average distance between point clouds (non-symmetric variant of CD). | Lower = better | CADCodeVerify |
| Metric | What it measures | Key numbers |
|---|---|---|
| Invalid Rate (IR) | % of outputs that fail to produce valid, non-degenerate geometry. The minimum bar — "did it even work." | GPT-4o zero-shot: 93%. Fine-tuned models: 1–7%. |
| Compile / Parsing Rate | % of generated code that executes without syntax or runtime error. Specific to code-gen papers. | CADCodeVerify: 96.5% with VLM feedback loop. |
| Physical Validity (PV) | % of outputs that produce a non-self-intersecting, watertight solid. | FlexCAD fine-tuned: 93.4%. GPT-4o raw: 48.9%. |
| Metric | What it measures |
|---|---|
| Command Accuracy (ACC_cmd) | % of generated commands matching the ground truth command type (line, arc, circle, extrude, etc.). DeepCAD achieves 99.50% on reconstruction. |
| Parameter Accuracy (ACC_param) | % of predicted parameters within tolerance η=3 out of 256 quantization levels. DeepCAD achieves 97.98%. |
| Coverage (COV) | % of reference shapes well-approximated by at least one generated shape. Measures diversity of the output distribution. |
| Minimum Matching Distance (MMD) | Average CD of each reference shape to its nearest generated shape. Measures fidelity. |
| Jensen-Shannon Divergence (JSD) | Distribution-level similarity between generated and reference point cloud distributions. |
| Metric | What it measures | Used by |
|---|---|---|
| LVM / VLM Score (0–10) | A vision-language model (LLaVA, Gemini) rates the rendered CAD output on shape quality, quantity, and visual fidelity vs the text prompt. Not standardized across papers. | CADFusion (LLaVA scorer), Text-to-CadQuery (Gemini) |
| Gemini Visual Eval (%) | Gemini is shown the rendered STL and the original text prompt, asked if they match. Binary pass/fail per example. | Text2CAD, Text-to-CadQuery |
| Human Realism Score | Human annotators rate whether generated shapes look like realistic mechanical parts (not a specific match). | FlexCAD: 42.1% best model vs 12.8% GPT-4o |
There are really only 3 datasets that matter for text-to-CAD evaluation:
| Dataset | Size | Source | Format | Papers that eval on it |
|---|---|---|---|---|
| DeepCAD dataset | 178,238 models | Onshape (real engineering CAD) | Command sequences (sketch + extrude) | DeepCAD, Text2CAD, CAD-GPT, CAD-Coder, CADFusion, FlexCAD, Text-to-CadQuery — basically everyone |
| Text-annotated DeepCAD | ~20K text-CAD pairs | DeepCAD + VLM-generated captions (+ some human) | Sequences + text at 4 specificity levels | Text2CAD, Text-to-CadQuery, CADFusion |
| Fusion 360 Gallery | 8,625 models | Autodesk | B-rep + mesh | CAD-Recode (secondary eval set) |
| CC3D (real scans) | Real-world point clouds | 3D scans of physical objects | Point clouds | CAD-Recode — tests generalization to messy real-world data |
| Custom hand-crafted benchmarks | 10–100 prompts | Paper authors | Text prompts | Kumar et al. 2026 (10 levels), CADCodeVerify (50 prompts), CAD Arena (our 20 prompts) |
Same shape described at 4 specificity levels. All 4 tested and reported separately. Key insight: detailed prompts score much better.
Result: abstract prompts → 42.3% Gemini visual pass. Detailed prompts → 71.4%. More info = better output.
Manual benchmark showing exactly where LLMs break down:
| # | Task | Result | Attempts | Time |
|---|---|---|---|---|
| 1 | Basic cube 50mm | ✓ Success | 1 | 19s |
| 2 | Cylinder r=25, h=60mm | ✓ Success | 1 | 20s |
| 3 | Rectangular box with 5mm fillets | ✓ Success | 2 | 42s |
| 4 | Boolean union (box + cylinder) | ✓ Success | 1 | 22s |
| 5 | Box with cylindrical hole subtracted | ✓ Success | 1 | 23s |
| 6 | Parametric plate with 4 corner holes | ✓ Success | 1 | 28s |
| 7 | Parametric hinge with multiple constrained segments | ✓ Success | 3 | 54s |
| 8 | Involute gear, 20 teeth, module 2mm | ✗ FAILED | 50 (max) | 836s |
| 9 | L-plate with complex cutouts | ✓ Success | 3 | 81s |
| 10 | Fully constrained structural frame with ribs | ✗ FAILED | 50 (max) | 909s |
VLMs (Gemini, LLaVA) generate text descriptions of test-set shapes automatically. Model must reconstruct those shapes from the descriptions. Ensures test distribution matches training distribution.
| Tier | Description | Example | Claude | GPT-5 | Zoo |
|---|---|---|---|---|---|
| T1 — Simple primitives | One shape | "A cube with side length 20mm" | 4/5 | 5/5 | 5/5 |
| T2 — Single part + features | One shape with holes/fillets | "A cylinder with a centered through-hole" | 5/5 | 4/5 | 4/5 |
| T3 — Multi-feature | Several operations combined | "An L-bracket with 3 mounting holes" | 4/5 | 3/5 | 5/5 |
| T4 — Complex functional | Specialized geometry | "A spur gear with 24 teeth, module 2" | 3/5 | 2/5 | 2/5 |
The foundational paper. Trains a Transformer autoencoder on 178K CAD command sequences from Onshape. Created the dataset and evaluation protocol the entire field uses.
Task: Autoencoding (encode → latent → decode) + unconditional generation. No text prompts.
| Method | ACC_cmd (%) | ACC_param (%) | Median CD (×10³) | IR (%) |
|---|---|---|---|---|
| DeepCAD + Augmentation | 99.50 | 97.98 | 0.752 | 2.72 |
| DeepCAD (no aug) | 99.36 | 97.47 | 0.787 | 3.30 |
| Alt-Regression baseline | — | — | 2.142 | 4.32 |
First major text-to-CAD paper. 363M model generates DeepCAD command sequences from text. Created the text-annotated DeepCAD dataset (4 levels per shape) that subsequent papers reuse for direct comparison.
Gemini eval: Renders the STL, shows image + original text prompt to Gemini, asks "does this match?" Reports % pass per text abstraction level.
| Prompt level | Median CD (×10³) | Mean CD | IR (%) | Gemini Visual (%) |
|---|---|---|---|---|
| All levels combined | 0.370 | 26.42 | 3.5 | 58.80 |
| Abstract only | 0.520 | — | 5.1 | 42.30 |
| Detailed geometric | 0.280 | — | 2.1 | 71.40 |
Qwen2.5-3B and CodeGPT-small (124M) generate CadQuery Python code instead of command sequences. Uses the same text-annotated DeepCAD test set as Text2CAD for direct comparison. Code generation wins at 10× fewer parameters.
| Method | Params | Median CD | Mean CD | IR (%) | Gemini Visual |
|---|---|---|---|---|---|
| Text-to-CadQuery (Qwen2.5-3B) | 3B | 0.191 | 10.23 | 6.5 | 69.3% |
| Text-to-CadQuery (CodeGPT-small) | 124M | 0.234 | 13.52 | — | 60.3% |
| Text2CAD (command sequences, prior SOTA) | 363M | 0.370 | 26.42 | 3.5 | 58.8% |
Qwen2.5-7B fine-tuned in two stages: SFT on text-code pairs, then GRPO (Group Relative Policy Optimization) with a geometric reward based on Chamfer Distance. Shows the single biggest performance jump in the field — 91% improvement from RL.
Reward design: Execute the generated code → compute CD vs ground truth → use CD as RL reward. Training cost: 146 hours on 8× A800 GPUs for GRPO stage.
| Training regime | Mean CD (×10³) | IR (%) |
|---|---|---|
| SFT only (no RL) | 74.55 | — |
| SFT + GRPO (geometric reward) | 6.54 | 1.45 |
| GPT-4o zero-shot | 133.52 | 93.0 |
LLaMA-3-8B with two alternating training stages: supervised learning on text-CAD pairs, then DPO with a VLM scorer (LLaVA) generating preference pairs. Alternates 5 iterations to prevent skill degradation.
VLM scorer: LLaVA-OneVision-Qwen2-7B rates rendered outputs on shape quality, quantity, and distribution. ~1,500 preference pairs per DPO iteration from 1,000 prompts.
| Method | LVM Score (/10) | Mean CD (×10³) | IR (%) |
|---|---|---|---|
| VLM-annotated (no human captions) | 6.56 | — | — |
| Human-annotated SL only | 7.69 | — | — |
| CADFusion (SL + 1 VF iter) | 8.28 | — | — |
| CADFusion (SL + 3 VF iter) | 8.76 | — | — |
| CADFusion (SL + 5 VF iter) | 8.96 | 19.89 | 6.20 |
| GPT-4o zero-shot | 5.13 | 133.52 | 93.0 |
LLaVA-1.5-7B augmented with spatial localization tokens (3D coordinates → 1D tokens). Takes image OR text as input. Key result: image-to-CAD is 3× easier than text-to-CAD.
| Method | Input | IR (%) | Median CD (×10³) | ACC_cmd | ACC_param |
|---|---|---|---|---|---|
| CAD-GPT | Image | 1.61 | 9.77 | 99.21 | 98.87 |
| HNC-CAD (prior best) | Image | 18.64 | 18.64 | — | — |
| GPT-4 few-shot | Image | 64.37 | 62.64 | — | — |
| CAD-GPT | Text | 7.43 | 28.33 | 98.73 | 98.12 |
| GPT-4 few-shot | Text | 76.97 | 187.52 | — | — |
| LLaMA-3.1 few-shot | Text | 98.68 | — | — | — |
Different task: input = point cloud (from a scan), output = CadQuery code. Qwen2-1.5B + lightweight point cloud projector. Trained on 1M procedurally generated scripts (no human annotation). 10× improvement over prior methods.
| Dataset | Mean CD | Median CD | IoU (%) | IR (%) |
|---|---|---|---|---|
| DeepCAD test | 0.30 | 0.16 | 92.0 | 0.4 |
| Fusion 360 test | 0.35 | 0.15 | 87.8 | 0.5 |
| CC3D (real scans) | 0.76 | 0.31 | 74.2 | 0.3 |
| Prior SOTA (CAD-SIGNet) | 3.33 | 2.36 | 81.5 | — |
LLaMA-3-8B + LoRA, trained with hierarchy-aware masking. Can be conditioned at any level of the CAD hierarchy: sketch, extrusion, face, loop, or curve.
| Method | COV (%) | PV (%) | Human realism (%) |
|---|---|---|---|
| FlexCAD (extrusion-level) | 68.5 | 93.3 | 42.1 |
| FlexCAD (sketch-level) | 65.6 | 93.4 | 39.6 |
| SkexGen | 55.2 | 72.6 | 21.3 |
| GPT-4o (no fine-tune) | 40.1 | 48.9 | 12.8 |
GPT-4 generates CAD code (OpenSCAD), a VLM renders it and asks Yes/No verification questions, feeds corrective feedback back. Custom 50-prompt benchmark. Achieves within 5% of human-in-the-loop.
| Method | Compile Rate (%) | IoGT | PC Distance | Hausdorff Dist |
|---|---|---|---|---|
| GPT-4 + CADCodeVerify | 96.5 | 0.944 | 0.127 | 0.419 |
| GPT-4 + 3D-Premise | 91.0 | 0.921 | 0.137 | 0.452 |
| GPT-4 (no refinement) | 91.0 | 0.912 | 0.142 | — |
| Human-in-the-loop (upper bound) | — | — | 0.120 | — |
| Rank | Method | Year | Output | Params | Median CD (×10³) | Mean CD | IR (%) | Visual |
|---|---|---|---|---|---|---|---|---|
| 1 | CAD-Coder (GRPO) | 2025 | CadQuery | 7B | ~0.17 | 6.54 | 1.45 | — |
| 2 | Text-to-CadQuery (Qwen2.5-3B) | 2025 | CadQuery | 3B | 0.191 | 10.23 | 6.5 | 69.3% Gemini |
| 3 | Text-to-CadQuery (CodeGPT-small) | 2025 | CadQuery | 124M | 0.234 | 13.52 | — | 60.3% Gemini |
| 4 | Text2CAD | 2024 | Cmd seq | 363M | 0.370 | 26.42 | 3.5 | 58.8% Gemini |
| 5 | CADFusion (5-iter DPO) | 2025 | Cmd seq | 8B | — | 19.89 | 6.2 | 8.96/10 LVM |
| 6 | CAD-GPT (text input) | 2025 | Cmd seq | 7B | 28.33 | — | 7.43 | — |
| — | GPT-4o zero-shot | 2024 | Code | — | — | 133.52 | 93.0 | 5.13/10 LVM |
| — | Claude-3.7 zero-shot | 2025 | Code | — | — | 186.53 | 47.0 | — |
| — | DeepSeek-V3 zero-shot | 2025 | Code | — | — | 186.69 | 52.0 | — |
| Model | API calls | STL success | STL % | Avg latency | Note |
|---|---|---|---|---|---|
| Claude Opus 4.6 | 20/20 | 16/20 | 80% | 8.6s | Zero-shot, no fine-tuning |
| Zoo ML-ephant | 16/20 | 16/20 | 80% | 64.8s | Proprietary; returns STL directly |
| GPT-5 | 20/20 | 14/20 | 70% | 23.3s | Zero-shot, no fine-tuning |
| Gemini 2.5 Flash | 7/20 | 7/20 | 35% | ~3s | Rate limited — not model failure |
| Definition | How measured | Limitation |
|---|---|---|
| It executes | Invalid Rate, Compile Rate | 0% IR doesn't mean correct shapes |
| Geometrically close to ground truth | Chamfer Distance | Requires ground truth. Misses editability, intent, manufacturing. |
| Visually matches the prompt | VLM/Gemini score | Subjective, not standardized across papers |
| Looks like a realistic part | Human realism study | Expensive, high variance |
| Correct operations in sequence | ACC_cmd, ACC_param | Command-sequence models only |
| Covers the design space | COV, MMD, JSD | Diversity metric, not prompt-following accuracy |
"There is no agreed-upon benchmark for text-to-CAD. Chamfer Distance measures geometric similarity but not parametric editability, manufacturing feasibility, or constraint satisfaction." — LLM survey (Zhang et al., 2025) and GDL survey (Heidari & Iosifidis, 2025), independently
Best current approach: combine low IR (it runs) + low CD (geometry matches) + high VLM score (looks right). No paper does all three perfectly — and none measure editability, manufacturability, or constraint correctness.
| Gap | Why it matters | Status |
|---|---|---|
| No standard prompt set | Every paper uses different prompts — results can't be directly compared across papers. | CAD Arena is designed to fix this. |
| CD ≠ design quality | Can have low CD but mechanically wrong shapes. | VLM scores used as proxy. |
| No editability metric | Core value of CAD is "change a dimension, model updates." Never tested. | Not measured anywhere. |
| No manufacturing feasibility | Generated parts often have impossible tolerances or non-machinable geometry. | Not measured anywhere. |
| DeepCAD too simple | Only sketch-and-extrude — not representative of real engineering parts. | CAD-Recode adds CC3D real scans. |
| No assembly evaluation | All eval is single-part. Real products are multi-part assemblies. | Not addressed by any paper. |
| Human preference / Elo | Most meaningful signal for "which model is better." What LLM Arena uses. | This is what the Arena page would provide. |
Compiled from 173 papers in feb8/papers-database.md · Deep analysis in feb8/report.md · CAD Arena benchmark results from eval/results/20260303_210402 · March 2026