Methodology

Full transparency on how models are prompted, executed, and scored. We show everything so you can evaluate whether the comparison is fair.

Static Evaluation

RESULTS PAGE

A curated set of 20 prompts across 4 difficulty tiers, run once against each model. Results are a fixed snapshot dated 2026-03-08. This gives a controlled baseline where every model sees the exact same prompts under the same conditions.

Prompt tiers

T1

Simple Primitives

5 prompts

Basic shapes — boxes, cylinders, spheres. No features.

T2

Single Part + Features

5 prompts

One part with holes, fillets, or chamfers.

T3

Multi-Feature Parts

5 prompts

Multiple boolean operations on a single body.

T4

Complex Functional

5 prompts

Gears, springs, snap-fit assemblies — hardest tier.

Scoring

Binary validity only. A result scores ✓ if the generated code produces a valid, non-empty STL file. A cube generated for a prompt asking for a cylinder would still score ✓. We acknowledge this is a limited metric — geometric accuracy (Chamfer Distance, IoU) and semantic evaluation (VLM judge) are planned additions.

Dynamic Evaluation

GENERATE PAGE

Anyone can submit any prompt and get live results from all models in parallel. Same execution pipeline as the static benchmark. Results are saved to the Explore page. You can rate each model's output with Good AI / Bad AI buttons — these votes feed into our crowdsourced ranking.

Pipeline (per model)

Prompt sent to model API

Your text prompt is sent to each selected model in parallel. For LLM models (Claude, GPT-5, Gemini), the prompt is wrapped in a system message instructing CadQuery output. For Zoo, the raw prompt is sent to their text-to-CAD API.

Code extracted

The model's response is parsed. If it contains markdown code fences, they're stripped. The raw Python code is extracted.

CadQuery execution on Modal

For LLM models: the code is sent to a sandboxed Modal serverless function running CadQuery 2.4.0. It validates syntax, executes the code, and exports the geometry to STL. Timeout: 55 seconds. For Zoo: their API handles execution internally and returns STL directly.

3D rendering + display

If execution succeeds, the STL is sent to the browser and rendered in a Three.js viewer with auto-rotation. If execution fails, the error message is shown alongside the code so you can see what went wrong.

Typical latency

Claude Sonnet 4.6

10–35s total (code generation ~5s, Modal execution ~10–25s)

Gemini 2.5 Flash

15–40s total (thinking model, longer code generation)

GPT-5

60–150s total (reasoning model, significant internal chain-of-thought)

Zoo / ML-ephant

5–90s (their backend handles everything end-to-end, varies by complexity)

The System Prompt

Claude, GPT-5, and Gemini all receive the exact same system prompt. This is the full text — nothing is hidden:

You are an expert CAD engineer specializing in CadQuery, a Python library for parametric 3D solid modeling.

Your task: generate CadQuery Python code that models the part described by the user.

STRICT REQUIREMENTS:
1. Start with: import cadquery as cq
2. Assign the final shape to a variable named exactly: result
3. `result` must be a cadquery.Workplane object
4. Use ONLY cadquery and the Python standard library
5. Output ONLY the raw Python code — no explanations, no markdown, no code fences

CADQUERY BASICS (use these patterns):
- Box:        result = cq.Workplane("XY").box(length, width, height)
- Cylinder:   result = cq.Workplane("XY").cylinder(height, radius)
- Sphere:     result = cq.Workplane("XY").sphere(radius)
- Hole:       .faces(">Z").workplane().hole(diameter)
- Shell:      .shell(-thickness)
- Union:      a.union(b)
- Cut:        a.cut(b)
- Fillet:     .edges("|Z").fillet(radius)
- Chamfer:    .edges("|Z").chamfer(length)
...

IMPORTANT: CadQuery does NOT have functions like SpurGear, makeGear, Thread, or similar high-level primitives.

Zoo / ML-ephant does not receive this system prompt. Their API takes raw text and returns geometry (KCL code + STL) through their own proprietary pipeline. We have no visibility into their internal prompting or agentic behavior.

Why These Models

Claude Sonnet 4.6

Anthropic's frontier model. We use Sonnet (not Opus) for cost efficiency — Sonnet scores comparably on our benchmark at ~10x lower cost. Represents the best of the “general-purpose LLM + CadQuery” approach.

GPT-5

OpenAI's reasoning model. Uses internal chain-of-thought before generating code, which is why latency is 60–150s. We set max_completion_tokens=16000 to give it enough budget for reasoning + output. Interesting because it trades speed for deliberation.

Gemini 2.5 Flash

Google's thinking model. Also uses internal reasoning (like GPT-5) but optimized for speed. Represents the fast-thinking approach — does the extra compute help for CAD?

Zoo / ML-ephant

The most accessible commercial text-to-CAD API. Zoo is the most popular dedicated text-to-CAD service with a public API. Their model outputs KCL (their own geometry language) and handles execution internally. We don't know whether their backend uses agentic loops, retries, or multi-step generation — their API is a black box.

Limitations

Single-shot only

LLM models (Claude, GPT-5, Gemini) get one attempt to generate working code with no retry loop. In practice, letting a model see its error and retry dramatically improves results — but we don't do that yet. This means current scores understate what these models can do with agentic workflows.

Timeouts

Some generations fail simply because the model or execution engine ran out of time. GPT-5 needs 60–150s for reasoning. Zoo can take 90s+ on complex prompts. These timeout failures don't reflect model capability — just infrastructure constraints.

Compilation-only eval

We currently only check whether the generated code produces a valid 3D solid. We don't measure whether the output actually matches the prompt — a cube generated for “a gear with 20 teeth” would score as a success. Geometric accuracy (Chamfer Distance), semantic evaluation (VLM judge), and dimensional checks are planned but not yet live.

What We Store

Every generation on the Generate page is saved. Here's exactly what:

Prompt

Your text prompt, verbatim.

Per model

Model ID, generated code, output type (CadQuery or KCL), latency in seconds, STL geometry (base64), execution error if any.

Votes

Which models you rated Good AI or Bad AI, with timestamp.

Not stored

Your IP address, browser info, or any personally identifying information. Email is stored separately only if you opt in.

Planned Improvements

More models

Academic models (Text2CAD, FlexCAD, CAD-Llama), open-weight LLMs (DeepSeek V3), and more commercial APIs (Adam AI).

Chamfer Distance

Geometric accuracy metric comparing generated STL against ground truth meshes. Requires creating reference models for each prompt.

VLM Judge

Automated scoring using a vision-language model that evaluates rendered views of the output against the prompt. Multi-view (front, top, right, isometric) for thoroughness.

Retry variants

Each model tested with 1-shot, 3-shot retry (feed error back), and agentic loop variants. Separate leaderboard entries.

Elo rankings

Bradley-Terry Elo scores computed from crowdsourced votes. Dynamic leaderboard that updates as more people use the arena.

Questions about methodology or want to submit a model? contact@cadarena.dev