Skip to content

How to: chunk a CAD file for RAG

Goal: convert a CAD file into chunks you can embed and store in a vector database for retrieval-augmented generation.

This assumes you can already parse a file with cadling.

from cadling import DocumentConverter
from cadling.chunker.hybrid_chunker import CADHybridChunker
doc = DocumentConverter().convert("assembly.step").document
chunker = CADHybridChunker(max_tokens=512, overlap_tokens=50)
chunks = list(chunker.chunk(doc))

Pick the chunker for your retrieval need:

ChunkerStrategy
CADHybridChunkerentity-level + semantic grouping (good default)
CADHierarchicalChunkerassembly-hierarchy-aware (preserves BOM structure)

Each chunk carries text plus structured metadata you should store alongside the vector — it makes retrieval filterable and the context richer.

for chunk in chunks:
text = chunk.text # what you embed
meta = chunk.meta # entity types, topology subgraph, bbox
print(chunk.chunk_id, len(meta.entity_ids), "entities")

Use any embedding model and vector store. The pattern:

records = []
for chunk in chunks:
records.append({
"id": chunk.chunk_id,
"text": chunk.text,
"vector": embed(chunk.text), # your embedding function
"metadata": {
"entity_ids": list(chunk.meta.entity_ids),
"source": "assembly.step",
},
})
vector_store.upsert(records) # your vector DB client

Store the metadata so you can filter retrieval (e.g. by entity type) and cite the source CAD file in answers.

Terminal window
cadling chunk assembly.step --max-tokens 512 --overlap 50 -o chunks.jsonl

Each line of chunks.jsonl is one chunk with its text and metadata, ready to feed an embedding/indexing job.