RAG Ingestion

The ingestion integration converts raw documents into vectors that the RAG query integration can retrieve. It runs once (or on a schedule) to populate your vector knowledge base. The query integration then searches that knowledge base at runtime.

This page covers building the ingestion integration in WSO2 Integrator: creating an automation, wiring up a data loader, a knowledge base, and running the integration. For retrieval and query, see RAG query.

What the RAG ingestion does

The ingest action on the Knowledge Base handles everything after document loading: it chunks each document, calls the embedding provider to produce vectors, and persists the resulting entries in the vector store.

Prerequisites

A document to ingest (Markdown, plain text, or other supported format).
A configured embedding provider. The default WSO2 provider works out of the box. Run the WSO2 Integrator command Ballerina: Configure default WSO2 model provider if you haven't already.

Step 1: Create an automation artifact

An Automation runs on integration startup. It is the right artifact type for a one-shot ingestion job.

In the design view, select + Add Artifact.
On the Artifacts page, select Automation and click Create.

Step 2: Add a text data loader

A Text Data Loader reads a file from disk and wraps its content as an ai:Document.

In the flow editor, click + to open the Add Node panel.
Go to AI > RAG > Data Loader.
Click Add Data Loader and select Text Data Loader.
In the configuration panel:

Field Value
Paths Path to the file you want to ingest, for example /resources/knowledge.pdf
Name A variable name for the loader, for example loader
Result Type The variable type, set to ai:TextDataLoader.
Click Save.

Field	Value
Paths	Path to the file you want to ingest, for example `/resources/knowledge.pdf`
Name	A variable name for the loader, for example `loader`
Result Type	The variable type, set to `ai:TextDataLoader`.

The node appears on the right panel. It does not load yet. You call its load function next.

Step 3: Load the documents

Call the loader's load function to execute the read and get back an ai:Document[].

Click on the loader node and select the load action call.
In the form that appears, set the result variable name, for example documents.

ai:Document is a generic content container. It holds the raw text from the source plus optional metadata (file name, URL, category) that you can use to filter results during retrieval.
Click Save.

Step 4: Create a vector knowledge base

The Vector Knowledge Base owns the three pluggable parts of a RAG store: a vector store, an embedding provider, and a chunker.

Click + to add a node.
Go to AI > RAG > Knowledge Base.
Click Add Knowledge Base and select Vector Knowledge Base.

Fill in the form:

Field	Required	Values
Vector Store	Yes	In-Memory Vector Store, Pinecone, pgvector, Weaviate, or Milvus.
Embedding Model	Yes	Default Embedding Provider (WSO2) or any other listed embedding provider. Produces 1536-dimensional dense vectors.
Chunker	No	`ai:AUTO` is the default and works for most cases. Switch to a specific chunker if retrieval quality degrades: use Markdown for `.md` files, HTML for web pages, or Generic Recursive for plain text.
Knowledge Base Name	—	For example, `knowledgeBase`

Click Save.

warning

In-memory storage is not durable and is local to the current integration runtime. All vectors are lost when the integration stops. Use In-Memory Vector Store only when ingestion and query run in the same integration runtime/process for local development or testing. If ingestion and query run as separate integrations or processes, configure an external vector store such as Pinecone, pgvector, Weaviate, or Milvus, and set vectorDimension: 1536 to match the WSO2 embedding provider's output.

warning

Use the same embedding provider for ingestion and retrieval. Vectors produced by different providers are not comparable. If you ingest with the WSO2 default provider and retrieve with OpenAI (or vice versa), the similarity search returns no useful results.

See Vector Stores and Knowledge Bases for the full configuration reference.

Step 5: Ingest the documents

Call ingest on the knowledge base to chunk, embed, and persist the loaded documents.

Click + after the knowledge base creation node.
Select the knowledgeBase variable and choose the Ingest action.
Set Documents to the documents variable from Step 3.
Click Save.

The ingest action:

Passes each ai:Document through the configured Chunker.
Sends each chunk to the Embedding Provider to produce a vector.
Persists the vector + chunk content in the Vector Store.

Step 6: Add a completion log

Add a Log Info node after the ingest call to confirm the integration finished.

Field	Value
Message	For example, `"RAG ingestion complete."`

This is optional but useful during development and when the automation runs on a schedule.

Running the integration

Click Run at the top right of the project view. WSO2 Integrator compiles and starts the integration. Because the artifact is an Automation, the ingestion function executes immediately on startup.

Watch the Run panel output for the log message. If the run fails, check:

The file path is correct relative to the project root.
The WSO2 model provider is configured (Ballerina: Configure default WSO2 model provider).
The embedding provider and vector store are reachable (for external stores).

Keeping the knowledge base up to date

The in-memory store is rebuilt on every restart, so re-running the integration re-ingests automatically. For durable stores:

Use Delete By Filter before re-ingesting a document to avoid duplicates. Filter by a metadata field like source or version.
Schedule the automation with a trigger (for example, an HTTP call, a cron, or a file-watch event) rather than running it once.

See Knowledge Bases — delete by filter for details.

What's next

RAG query — retrieve chunks at runtime and generate grounded responses.
Knowledge Bases — ingest, retrieve, and delete-by-filter reference.
Vector Stores — picking and configuring a production store.
Embedding Providers — available providers and dimension requirements.
Chunkers — controlling how documents are split before ingest.

What the RAG ingestion does​

Step 1: Create an automation artifact​

Step 2: Add a text data loader​

Step 3: Load the documents​

Step 4: Create a vector knowledge base​

Step 5: Ingest the documents​

Step 6: Add a completion log​

Running the integration​

Keeping the knowledge base up to date​

What's next​