Skip to main content

Chunkers

A Chunker splits a document into smaller pieces before they are embedded. Smaller chunks improve retrieval precision but lose surrounding context; larger chunks preserve context but are less precise.

Most projects start with the default (AUTO) and only configure a chunker explicitly when retrieval quality drops because chunks are too big or too small for the content.

Available actions

A chunker exposes one action.

ActionWhat it doesRequired parameters
ChunkSplits a document into chunks according to the chunker's strategy.Document (the input document).

You don't usually call this directly. The Knowledge Base calls it for you on Ingest.

AUTO and DISABLE

You almost always pick one of these in the Vector Knowledge Base form's Chunker field. Only switch to a custom chunker when the defaults aren't producing useful chunks.

ValueEffect
AUTO (default)The Knowledge Base picks a chunker automatically based on document type: Markdown chunker for .md, HTML chunker for .html, Generic Recursive for everything else.
DISABLENo chunking. Each document is treated as a single chunk. Use when documents are already small or pre-chunked.

Where to find chunkers

Inside the Create Vector Knowledge Base form click + Create New Chunker. The Select Chunker picker shows the available types:

Select Chunker picker listing three chunkers: Generic Recursive Chunker (Represents a Generic document chunker. Provides functionality to recursively chunk a text), Markdown Chunker (Represents a Markdown document chunker. Provides functionality to recursively chunk a Markdown document), and Html Chunker (Represents an HTML document chunker. Provides functionality to recursively chunk a HTML document).

Implementations overview

All three local chunkers ship in the core ballerina/ai package. The Devant Chunker is a remote implementation that delegates the work to the WSO2 Integration Platform.

ChunkerModuleDefault strategyUse when
Generic Recursiveballerina/aiPARAGRAPHPlain text. Falls back through paragraph, sentence, line, word, then character.
Markdownballerina/aiMARKDOWN_HEADERMarkdown documents. Splits on headings first, falls back through code block, horizontal line, paragraph, line, sentence, word, then character.
HTMLballerina/aiHTML_HEADERHTML documents. Splits on <h1><h6>, falls back through <p>, <br>, sentence, word, character.
Devantballerinax/ai.devantRECURSIVEBinary documents (PDF, DOCX, PPTX). Delegates to the WSO2 Integration Platform.

All four chunkers share the same two top-level parameters: Max Chunk Size (the cap, in characters, per chunk) and Max Overlap Size (characters reused at the boundary between adjacent chunks for context continuity). The local chunkers default to 200 and 40; the Devant chunker defaults to 500 and 50.

Generic Recursive Chunker

For plain text. Begins splitting using the chosen strategy and recursively falls back to finer-grained units when a chunk would exceed the size limit.

Create form

Create Chunker form for Generic Recursive showing the banner &#39;This operation has no required parameters. Optional settings can be configured below.&#39; Advanced Configurations Expand link, Chunker Name aiGenericrecursivechunker, Result Type ai.

No required fields. Sensible defaults work for most prose.

Advanced configurations

Generic Recursive Chunker Create form with Advanced Configurations expanded showing Max Chunk Size (default 200), Max Overlap Size (default 40), Strategy (default PARAGRAPH).

FieldDefaultAvailable valuesWhat it controls
Max Chunk Size200Any positive integer (characters)Maximum characters allowed per chunk.
Max Overlap Size40Any non-negative integer (characters)Characters reused from the end of the previous chunk when starting the next. The overlap is whole sentences taken in reverse, capped at this length, to preserve cross-boundary context.
StrategyPARAGRAPHPARAGRAPH, SENTENCE, LINE, WORD, CHARACTERStarting splitting unit. The chunker falls back through finer units automatically when a chunk overflows.

Strategy fall-back order

Strategies are tried from large unit to small. If a chunk would exceed Max Chunk Size, the chunker falls back to the next finer strategy.

StrategyBoundaryFalls back to
PARAGRAPHTwo or more newlines (\n\n).SENTENCE, LINE, WORD, CHARACTER
SENTENCEOpenNLP sentence detector.WORD, CHARACTER
LINEOne or more newlines (\n).WORD, CHARACTER
WORDOne or more spaces.CHARACTER
CHARACTERIndividual characters.(terminal)

Markdown Chunker

Header-aware chunker for Markdown. Starts at heading level ## and walks down before falling back to structure-aware then character-level strategies.

Create form

Create Chunker form for Markdown showing the banner &#39;This operation has no required parameters. Optional settings can be configured below.&#39; Advanced Configurations Expand link, Chunker Name aiMarkdownchunker, Result Type ai.

No required fields.

Advanced configurations

Markdown Chunker Create form with Advanced Configurations expanded showing Max Chunk Size (default 200), Max Overlap Size (default 40), Strategy (default MARKDOWN_HEADER).

FieldDefaultAvailable valuesWhat it controls
Max Chunk Size200Any positive integerMax characters per chunk.
Max Overlap Size40Any non-negative integerOverlap characters between adjacent chunks.
StrategyMARKDOWN_HEADERMARKDOWN_HEADER, CODE_BLOCK, HORIZONTAL_LINE, PARAGRAPH, LINE, SENTENCE, WORD, CHARACTERStarting splitting unit.

Strategy options

StrategyWhat it splits on
MARKDOWN_HEADERMarkdown headers, starting at ## and walking down to ######. Falls back through CODE_BLOCK, HORIZONTAL_LINE, PARAGRAPH, LINE, SENTENCE, WORD, CHARACTER.
CODE_BLOCKFenced code blocks. Each becomes a chunk tagged with type code_block (and a language annotation if declared). Code-block chunks are never merged with neighbours.
HORIZONTAL_LINEThe patterns ***, ---, ___. Falls back to PARAGRAPH.
PARAGRAPH / SENTENCE / LINE / WORD / CHARACTERSame semantics as the Generic Recursive Chunker.

HTML Chunker

Tag-aware chunker for HTML. Starts at heading tags and falls back through paragraphs, line breaks, then character-level strategies.

Create form

Create Chunker form for HTML showing the banner &#39;This operation has no required parameters. Optional settings can be configured below.&#39; Advanced Configurations Expand link, Chunker Name aiHtmlchunker, Result Type ai.

No required fields.

Advanced configurations

HTML Chunker Create form with Advanced Configurations expanded showing Max Chunk Size (default 200), Max Overlap Size (default 40), Strategy (default HTML_HEADER).

FieldDefaultAvailable valuesWhat it controls
Max Chunk Size200Any positive integerMax characters per chunk.
Max Overlap Size40Any non-negative integerOverlap characters between adjacent chunks.
StrategyHTML_HEADERHTML_HEADER, HTML_PARAGRAPH, HTML_LINE, SENTENCE, WORD, CHARACTERStarting splitting unit.

Strategy options

StrategyWhat it splits on
HTML_HEADER<h1> through <h6>. Falls back through HTML_PARAGRAPH, HTML_LINE, SENTENCE, WORD, CHARACTER.
HTML_PARAGRAPH<p> tags. Falls back through HTML_LINE, SENTENCE, WORD, CHARACTER.
HTML_LINE<br> tags. Falls back through SENTENCE, WORD, CHARACTER.
SENTENCE / WORD / CHARACTERSame semantics as the other chunkers.

Devant Chunker

A remote chunker that delegates the actual splitting to the WSO2 Integration Platform. Useful when you have binary documents (PDF, DOCX, PPTX) that the local chunkers can't read directly. Pair it with the Devant Binary Data Loader to read the file as a binary document.

Official website: WSO2 Integration Platform.

Create form

The Devant Chunker is added from the same Select Chunker picker. Its create form requires connecting to the WSO2 Integration Platform.

FieldRequiredDefaultAvailable values
Service URLYesThe WSO2 Integration Platform service endpoint URL.
Access TokenYesAccess token for authenticating with WSO2 Integration Platform.

Advanced configurations

FieldDefaultAvailable valuesWhat it controls
Maximum Chunk Size in Characters500Any positive integerMax characters per chunk.
Maximum Overlap Size in Characters50Any non-negative integerOverlap characters between adjacent chunks.
Chunking StrategyRECURSIVERECURSIVE, PARAGRAPH, SENTENCE, CHARACTERThe chunking strategy WSO2 Integration Platform uses.

Plus the Standard HTTP Advanced Configurations.

Only binary documents are accepted. The document's metadata must include a file name so WSO2 Integration Platform can detect the source format.


Selecting a chunk size

Chunk size is the most-tuned knob in RAG. Some rules of thumb:

SymptomTry
Retrieval brings back the right doc but the wrong section.Smaller Max Chunk Size (try 150).
Retrieved chunks lose context (an answer is cut off mid-thought).Larger Max Chunk Size (try 400) or larger Max Overlap Size.
Many tiny low-relevance results crowding out the good one.Larger Max Chunk Size.
Retrieval cost is high (many calls, slow queries).Larger Max Chunk Size, fewer chunks per document.

The defaults (200 / 40) are tuned for prose. Code-heavy or table-heavy content usually wants larger chunks.

What's next

  • Embedding Providers - Configure the model that converts chunks into vectors.
  • Vector Stores - Set up the store that indexes and retrieves those vectors.
  • Knowledge Bases - Combine a chunker, embedding provider, and vector store into a single ingest-and-retrieve component.
  • RAG - End-to-end walkthrough of the ingestion and query flows.