Document Stores
Learn how to use the Flowise Document Stores, written by @toi500
Last updated
Learn how to use the Flowise Document Stores, written by @toi500
Last updated
Flowise's Document Stores offer a versatile approach to data management, enabling you to upload, split, and prepare your dataset and upsert it in a single location.
This centralized approach simplifies data handling and allows for efficient management of various data formats, making it easier to organize and access your data within the Flowise app.
In this tutorial, we will set up a Retrieval Augmented Generation (RAG) system to retrieve information about the LibertyGuard Deluxe Homeowners Policy, a topic that LLMs are not extensively trained on.
Using the Flowise Document Stores, we'll prepare and upsert data about LibertyGuard and its set of home insurance policies. This will enable our RAG system to accurately answer user queries about LibertyGuard's home insurance offerings.
Start by adding a Document Store and naming it. In our case, "LibertyGuard Deluxe Homeowners Policy".
Enter the Document Store that you just created and select the Document Loader you want to use. In our case, since our dataset is in PDF format, we'll use the PDF Loader.
First, we start by uploading our PDF file.
Then, we add a unique metadata key. This is optional, but a good practice as it allows us to target and filter down this same dataset later on if we need to.
Finally, select the Text Splitter you want to use to chunk your data. In our particular case, we will use the Recursive Character Text Splitter.
In this guide, we've added a generous Chunk Overlap size to ensure no relevant data gets missed between chunks. However, the optimal overlap size is dependent on the complexity of your data. You may need to adjust this value based on your specific dataset and the nature of the information you want to extract. More about this topic in this guide.
We can now preview how our data will be chunked using our current Text Splitter configuration; chunk_size=1500
and chunk_overlap=750
.
It's important to experiment with different Text Splitters, Chunk Sizes, and Overlap values to find the optimal configuration for your specific dataset. This preview allows you to refine the chunking process and ensure that the resulting chunks are suitable for your RAG system.
Note that our custom metadata company: "liberty"
has been inserted into each chunk. This metadata allows us to easily filter and retrieve information from this specific dataset later on, even if we use the same vector store index for other datasets.
Once you are satisfied with the chunking process, it's time to process your data.
After processing your data, you retain the ability to refine individual chunks by deleting or adding content. This granular control offers several advantages:
Enhanced Accuracy: Identify and rectify inaccuracies or inconsistencies present in the original data, ensuring the information used in your application is reliable.
Improved Relevance: Refine chunk content to emphasize key information and remove irrelevant sections, thereby increasing the precision and effectiveness of your retrieval process.
Query Optimization: Tailor chunks to better align with anticipated user queries, making them more targeted and improving the overall user experience.
With our data properly processed - loaded via a Document Loader and appropriately chunked -, we can now proceed to configure the upsert process.
The upsert process comprises three fundamental steps:
Embedding Selection: We begin by choosing the appropriate embedding model to encode our dataset. This model will transform our data into a numerical vector representation.
Data Store Selection: Next, we determine the Vector Store where our dataset will reside.
Record Manager Selection (Optional): Finally, we have the option to implement a Record Manager. This component provides the functionalities for managing our dataset once it's stored within the Vector Store.
Click on the "Select Embeddings" card and choose your preferred embedding model. In our case, we will select OpenAI as the embedding provider and use the "text-embedding-ada-002" model with 1536 dimensions.
Click on the "Select Vector Store" card and choose your preferred Vector Store. In our case, as we need a production-ready option, we will select Upstash.
For advanced dataset management within the Vector Store, you can optionally select and configure a Record Manager. Detailed instructions on how to set up and utilize this feature can be found in the dedicated guide.
To begin the upsert process and transfer your data to the Vector Store, click the "Upsert" button.
As illustrated in the image below, our data has been successfully upserted into the Upstash vector database. The data was divided into 85 chunks to optimize the upsertion process and ensure efficient storage and retrieval.
To quickly test the functionality of your dataset without navigating away from the Document Store, simply utilize the "Retrieval Query" button. This initiates a test query, allowing you to verify the accuracy and effectiveness of your data retrieval process.
In our case, we see that when querying for information about kitchen flooring coverage in our insurance policy, we retrieve 4 relevant chunks from Upstash, our designated Vector Store. This retrieval is limited to 4 chunks as per the defined "top k" parameter, ensuring we receive the most pertinent information without unnecessary redundancy.
Finally, our Retrieval-Augmented Generation (RAG) system is operational. It's noteworthy how the LLM effectively interprets the query and successfully leverages relevant information from the chunked data to construct a comprehensive response.
You can use the vector store that was configured earlier:
Or, use the Document Store (Vector):
There are also APIs support for creating, updating and deleting document store. Refer to Document Store API for more details. In this section, we are going to highlight the 2 of the most used APIs: upsert and refresh.
You can upsert a new file using an existing document loader and upsert configuration. For example, you have a PDF loader inside document store, and the goal is to use the existing configuration, but with a new file.
First, take note of the store ID and document ID:
Since Pdf File Loader has Upload File functionality, form data will be used to allow sending files through API.
Make sure the sent file type is compatible with the expected file type from document loader. For example, if a PDF File Loader is being used, you should only send .pdf files.
To avoid having separate loaders for different file types, we recommend to use File Loader
For other Document Loaders nodes without Upload File functionality, the API body is in JSON format:
Often times you might want to re-process every documents loaders within document store to fetch the latest data, and upsert to vector store, to keep everything in sync. This can be done via Refresh API:
You can also override existing configuration of specific document loader:
We started by creating a Document Store to organize the LibertyGuard Deluxe Homeowners Policy data. This data was then prepared by uploading, chunking, processing, and upserting it, making it ready for our RAG system.
Advantages of the Document Store:
Document Stores offer several benefits for managing and preparing data for Retrieval Augmented Generation (RAG) systems:
Organization and Management: They provide a central location for storing, managing, and preparing your data.
Data Quality: The chunking process helps structure data for accurate retrieval and analysis.
Flexibility: Document Stores allow for refining and adjusting data as needed, improving the accuracy and relevance of your RAG system.
In this video, Leon provides a step by step tutorial on using Document Stores to easily manage your RAG knowledge bases in FlowiseAI.