Methodology

Full transparency on how models are prompted, executed, and scored. We show everything so you can evaluate whether the comparison is fair.

Static Evaluation

RESULTS PAGE

A curated set of 20 prompts across 4 difficulty tiers, run once against each model. Results are a fixed snapshot dated 2026-03-08. This gives a controlled baseline where every model sees the exact same prompts under the same conditions.

Prompt tiers

T1
Simple Primitives
5 prompts
Basic shapes — boxes, cylinders, spheres. No features.
T2
Single Part + Features
5 prompts
One part with holes, fillets, or chamfers.
T3
Multi-Feature Parts
5 prompts
Multiple boolean operations on a single body.
T4
Complex Functional
5 prompts
Gears, springs, snap-fit assemblies — hardest tier.

Scoring

Binary validity only. A result scores ✓ if the generated code produces a valid, non-empty STL file. A cube generated for a prompt asking for a cylinder would still score ✓. We acknowledge this is a limited metric — geometric accuracy (Chamfer Distance, IoU) and semantic evaluation (VLM judge) are planned additions.

Dynamic Evaluation

GENERATE PAGE

Anyone can submit any prompt and get live results from all models in parallel. Same execution pipeline as the static benchmark. Results are saved to the Explore page. You can rate each model's output with Good AI / Bad AI buttons — these votes feed into our crowdsourced ranking.

Pipeline (per model)

1
Prompt sent to model API
Your text prompt is sent to each selected model in parallel. For LLM models (Claude, GPT-5, Gemini), the prompt is wrapped in a system message instructing CadQuery output. For Zoo, the raw prompt is sent to their text-to-CAD API.
2
Code extracted
The model's response is parsed. If it contains markdown code fences, they're stripped. The raw Python code is extracted.
3
CadQuery execution on Modal
For LLM models: the code is sent to a sandboxed Modal serverless function running CadQuery 2.4.0. It validates syntax, executes the code, and exports the geometry to STL. Timeout: 55 seconds. For Zoo: their API handles execution internally and returns STL directly.
4
3D rendering + display
If execution succeeds, the STL is sent to the browser and rendered in a Three.js viewer with auto-rotation. If execution fails, the error message is shown alongside the code so you can see what went wrong.

Typical latency

Claude Sonnet 4.6
10–35s total (code generation ~5s, Modal execution ~10–25s)
Gemini 2.5 Flash
15–40s total (thinking model, longer code generation)
GPT-5
60–150s total (reasoning model, significant internal chain-of-thought)
Zoo / ML-ephant
5–90s (their backend handles everything end-to-end, varies by complexity)

The System Prompt

Claude, GPT-5, and Gemini all receive the exact same system prompt. This is the full text — nothing is hidden:

You are an expert CAD engineer specializing in CadQuery, a Python library for parametric 3D solid modeling.

Your task: generate CadQuery Python code that models the part described by the user.

STRICT REQUIREMENTS:
1. Start with: import cadquery as cq
2. Assign the final shape to a variable named exactly: result
3. `result` must be a cadquery.Workplane object
4. Use ONLY cadquery and the Python standard library
5. Output ONLY the raw Python code — no explanations, no markdown, no code fences

CADQUERY BASICS (use these patterns):
- Box:        result = cq.Workplane("XY").box(length, width, height)
- Cylinder:   result = cq.Workplane("XY").cylinder(height, radius)
- Sphere:     result = cq.Workplane("XY").sphere(radius)
- Hole:       .faces(">Z").workplane().hole(diameter)
- Shell:      .shell(-thickness)
- Union:      a.union(b)
- Cut:        a.cut(b)
- Fillet:     .edges("|Z").fillet(radius)
- Chamfer:    .edges("|Z").chamfer(length)
...

IMPORTANT: CadQuery does NOT have functions like SpurGear, makeGear, Thread, or similar high-level primitives.
Zoo / ML-ephant does not receive this system prompt. Their API takes raw text and returns geometry (KCL code + STL) through their own proprietary pipeline. We have no visibility into their internal prompting or agentic behavior.

Why These Models

Claude Sonnet 4.6
Anthropic's frontier model. We use Sonnet (not Opus) for cost efficiency — Sonnet scores comparably on our benchmark at ~10x lower cost. Represents the best of the “general-purpose LLM + CadQuery” approach.
GPT-5
OpenAI's reasoning model. Uses internal chain-of-thought before generating code, which is why latency is 60–150s. We set max_completion_tokens=16000 to give it enough budget for reasoning + output. Interesting because it trades speed for deliberation.
Gemini 2.5 Flash
Google's thinking model. Also uses internal reasoning (like GPT-5) but optimized for speed. Represents the fast-thinking approach — does the extra compute help for CAD?
Zoo / ML-ephant
The most accessible commercial text-to-CAD API. Zoo is the most popular dedicated text-to-CAD service with a public API. Their model outputs KCL (their own geometry language) and handles execution internally. We don't know whether their backend uses agentic loops, retries, or multi-step generation — their API is a black box.

Limitations

Single-shot only
LLM models (Claude, GPT-5, Gemini) get one attempt to generate working code with no retry loop. In practice, letting a model see its error and retry dramatically improves results — but we don't do that yet. This means current scores understate what these models can do with agentic workflows.
Timeouts
Some generations fail simply because the model or execution engine ran out of time. GPT-5 needs 60–150s for reasoning. Zoo can take 90s+ on complex prompts. These timeout failures don't reflect model capability — just infrastructure constraints.
Compilation-only eval
We currently only check whether the generated code produces a valid 3D solid. We don't measure whether the output actually matches the prompt — a cube generated for “a gear with 20 teeth” would score as a success. Geometric accuracy (Chamfer Distance), semantic evaluation (VLM judge), and dimensional checks are planned but not yet live.

What We Store

Every generation on the Generate page is saved. Here's exactly what:

Prompt
Your text prompt, verbatim.
Per model
Model ID, generated code, output type (CadQuery or KCL), latency in seconds, STL geometry (base64), execution error if any.
Votes
Which models you rated Good AI or Bad AI, with timestamp.
Not stored
Your IP address, browser info, or any personally identifying information. Email is stored separately only if you opt in.

Planned Improvements

More models
Academic models (Text2CAD, FlexCAD, CAD-Llama), open-weight LLMs (DeepSeek V3), and more commercial APIs (Adam AI).
Chamfer Distance
Geometric accuracy metric comparing generated STL against ground truth meshes. Requires creating reference models for each prompt.
VLM Judge
Automated scoring using a vision-language model that evaluates rendered views of the output against the prompt. Multi-view (front, top, right, isometric) for thoroughness.
Retry variants
Each model tested with 1-shot, 3-shot retry (feed error back), and agentic loop variants. Separate leaderboard entries.
Elo rankings
Bradley-Terry Elo scores computed from crowdsourced votes. Dynamic leaderboard that updates as more people use the arena.

Questions about methodology or want to submit a model? contact@cadarena.dev