Unleash the Power of AI: Develop a Question Answering Service with OpenAI and Ballerina
- Jayani Hewavitharana
- Job Title - WSO2
This article is based on Ballerina Swan Lake Update 4 (2201.4.0).
The new wave of large language models (LLMs) has gained immense popularity worldwide. Many people have been exploring various use cases around the powerful LLMs, with question answering based on document data being a particularly interesting scenario.
Typically, when we need answers in a specific area, finding exactly what we’re looking for can be a challenge. It takes time to skim through all the information, and we may still end up empty-handed. This is where LLMs come in handy! They can answer all kinds of questions in real-time, along with explanations and sources if we need them. However, there’s a catch: LLMs do not have the latest information on recent developments in various domains. They are also known to generate plausible but false responses, also known as “hallucinations”. As a result, if the questions asked are highly specific to a relatively new field, the LLM may generate inaccurate responses based on related but outdated or incorrect information used during its training.
There are ways to improve LLMs and make them more suitable for what we want to achieve. One such approach is fine-tuning, where we use a dataset that exhibits the outcomes we expect to tune the model to behave similarly and give similar outcomes. Unfortunately, this approach is not very useful for our problem with the question answering scenario because fine-tuning helps the model to remember patterns, but not gain knowledge.
Prompt engineering offers a more promising solution by leveraging LLMs ability for in-context learning. In-context learning is the LLMs ability to learn from the information given in the prompt itself; may it be instructions, examples, or even knowledge. Therefore, we can provide relevant knowledge to the model in the prompt along with the question and it will give us an accurate answer. However, this doesn’t mean we’re back to searching for relevant content on our own. This is where embeddings come into play.
Embeddings are numerical representations of text that allow the comparison of the similarity between two texts. It helps with providing relevant information to the model for question answering by comparing the embedding of a question with the embeddings of documents we have. We can identify the most similar ones and add them to the prompt for the model to use. This comparison can be done programmatically on our own or we can use vector databases such as Pinecone and Weaviate, which handles this job for us and return similar content. We simply have to add this content to the prompt and send it to the LLM.
While there are many articles and tutorials that discuss LLM-based implementations for question answering using languages such as Python, this article aims to showcase the simplicity of building AI use cases with Ballerina. With its newly introduced support for AI, Ballerina, a language specialized for integration, is the ideal choice for implementing such use cases that require communicating with multiple APIs. Ballerina can help in connecting with hundreds of public APIs easily with its in-built connectors, which can greatly benefit when building end-to-end interactions.
For example, let’s consider this use case as a ChatBot service with the following integrations.
- Data retrieval: Connect to Google Sheets and load data from them using the ballerinax/googleapis.sheets connector.
- Embedding search: Connect to hosted vector DBs such as Pinecone (ballerinax/pinecone.vector) and Weaviate (ballerinax/weaviate).
- Answer generation: Connect to OpenAI GPT-3 (ballerinax/openai.text) or the ChatGPT (ballerinax/openai.chat) APIs (alternatively, Ballerina supports Azure OpenAI APIs also) to generate answers for questions.
- Receiving questions and responding to users: This entire process can be implemented easily as a Ballerina service.
A high level view of our implementation of the question answering use case is shown below.
Image 1: High level component interaction diagram for the question answering use case
During initialization, the service connects to a Google sheet specified by its URL and loads its contents. Then, it uses the OpenAI embeddings model to obtain the embeddings for each row of content. The content, along with its corresponding embeddings, is stored in the Pinecone vector database.
When the service receives a request to answer a question, it will first obtain the embedding of the question. It then passes the question embedding to the Pinecone vector database, which will do the similarity comparisons and fetch the most relevant content. The Ballerina service constructs the prompt by combining the retrieved content, question, and instruction, which are sent to the OpenAI GPT-3 model. The model responds with the answer and it is forwarded to the user by the service.
To get started with the example, let’s first set up the following prerequisites:
- OpenAI API Key
- Google Sheets access token (can be obtained using Google API Console)
- A Vector DB (we will use Pinecone in this example)
- An IDE (VS Code is preferred with the Ballerina extension installed)
For a complete guide on how to fulfill the prerequisites, refer to the sample Question Answering based on Context using OpenAI GPT-3 and Pinecone. We store all the keys and tokens in a Config.toml file in the project folder (which we will create in the next section) so that we can access them via configurable variables.
Once we have obtained the keys and access tokens, we can create a new Google sheet and populate it with some data. Make sure that you create the new sheet using the account for which the access token was obtained for. In this example, we have some sample data obtained from Choreo documentation, but you can use content from any preferred domain.
Now that all the prerequisites are set up, we can start building our service that will take a question as the input and provide an accurate answer based on the latest information.
Create and initialize the service
As our first step, we will look at how we can initialize the Ballerina service to read the data from a Google sheet and insert them to the Pinecone vector database.
First and foremost, let us create a new Ballerina project to hold our service implementation. We can do this by executing the following command in the desired location. This will generate a folder with all the necessary artifacts to create and run the service.
Then, we will create the Ballerina service in the main.bal file, which will answer users’ questions by referring to the document content. We will initialize an HTTP service, which listens on port 8080.
Load the data from Google Sheets
As mentioned earlier, in this example, we will load our document data or content from the Google sheet that we created previously when setting up the prerequisites populated with sample data. We will load this data from the sheet along with their embeddings into our vector database so that we can easily fetch relevant content for a given question to construct the context.
To ensure that data loading from the Google sheet to the Pinecone vector database happens only once during the service initialization, we will implement this logic in the init function of the service.
To store our content in the Pinecone vector database, we need to obtain the embeddings for the content. We can compute the embeddings using the OpenAI embeddings model with the help of Ballerina’s openai.embeddings connector. For this, we will create a client object for OpenAI embeddings by providing the API key, which the service will fetch from the Config.toml file to the configurable variable. Then, we send a request to the model via the client to obtain the embedding vector by providing the text and the model name.
To read from our Google sheet that contains the data, we first need to initialize a Google Sheets client object by providing the credentials. Then, we can fetch the content from the Google sheet by querying the range of columns. In our case, we fetch columns A and B, which contain the titles and content respectively. We can use the googleapis.sheets connector provided by Ballerina to fetch the data from the sheet.
Notice that we fetch data starting from row 2 (A2:B) assuming that the first row contains the headers “Title” and “Content”.
Upload the data to Pinecone vector database
In order to access our Pinecone database, we will initialize a Pinecone client object using the key and URL. Then, we will initialize an empty array of Pinecone vectors to hold our data.
We will iterate through the rows that we fetched from the Google sheet to get each title and content and also to obtain the embedding. Then, we will store all this information in the array of Pinecone vectors.
And finally, we insert the data vectors to the Pinecone database. We can do this using the Pinecone connector client by invoking the /vectors/upsert.post method. The namespace indicates the location of the collection of data. We named our namespace “ChoreoDocs” indicating that it contains content from Choreo documentation.
That completes the initialization of the question answering service. The complete implementation of the init function is given below.
Construct the prompt with context
In order to use an OpenAI model to answer questions, we must consider the limitations of these models. As we previously discussed, the accuracy and quality of answers may be diminished for specific domains since the model does not have recent knowledge. To address this limitation, we will provide the model with relevant context extracted from our stored content to help provide accurate answers.
However, OpenAI models also have token size restrictions, limiting the amount of information that can be included in the prompt. To ensure the most relevant information is included, we will utilize the vector database, Pinecone, to fetch a subset of data that is closely related to the question. This is done by comparing the similarity of the content embeddings stored in the database with the embedding of the question obtained through the OpenAI embeddings model. By passing the question embedding to the Pinecone client along with other meta-information, we can fetch the rows similar to the question in the order of similarity. In this example, we will fetch the top 10 most similar rows.
Now that we have fetched all the related data, it is time to construct the prompt by providing the data as context. Although we have fetched only a subset of the content from the vector database, we need to be mindful of the token limit. If we try to include all the retrieved content, the prompt may still exceed the limit. To address this issue, we iteratively add the most relevant content to the prompt until the token limit is reached (also leaving some room for the answer). This ensures that we provide the model with the most pertinent context without exceeding the token limit.
Once we have the context ready, we need to combine it with an instruction prompt, which will indicate to the model that it should refer to the context and answer the question.
That completes the prompt construction with the relevant context to answer a question. The complete function, which constructs the prompt is given below.
Generate the answer
Now it is time to put everything together and answer a question that comes as a request to the service. For this, we will create a GET resource function called answer within the service that accepts the user’s question as a parameter. Next, we will construct the prompt by extracting the relevant context from the previously fetched data as discussed earlier. Finally, we will generate the answer using the OpenAI text-davinci-003 model, which we will access through Ballerina's openai-text connector client.
By now we have implemented a Ballerina service, which can answer questions in a specific domain by referring to a set of documents. For the complete implementation, refer to the sample in the Ballerina ai-samples GitHub repository.
Run the Service
Now we can run the service and send a request to see it in action. To run the service, navigate to the project directory and execute the following command.
The command will start the service in the localhost and listen on port 8080 for requests. We can now send a GET request to the service by providing the question as a query parameter. For example, we can execute the following CURL command to ask the question “What is Choreo?”.
We can see in the below response how the service would answer a question that we provide via the GET request.
In this article, we explored how to develop a question answering service in Ballerina that leverages the capabilities of OpenAI and Pinecone vector databases using the newly released connectors. The example demonstrates how to load data from a Google Sheet into the Pinecone database and use the data as reference to construct the prompt for the OpenAI model by providing the context obtained from the most similar content for a given question.
As AI integration is becoming more important in modern-day AI scenarios, we can see how Ballerina, a language specialized in integration, makes it easy and simple to implement AI use cases that involve multiple externally hosted models and services. This demonstrates the fact that we can build powerful and intelligent AI applications by combining the strengths of Ballerina, OpenAI, and Pinecone.