Implementing vector embedding and semantic search for large document repositories is the first step in Retrieval Augmented Generation(RAG) pipelines. Tools like Langchain, LlamaIndex provide good introduction to RAG. However, enterprise scale production implementations where millions of documents have to be processed, requires an architecture that can scale up, managed easily using configurations, monitored in real-time with detailed logs.
AIWhispr is a no/low code tool to automate vector embedding pipelines for semantic search. A simple configuration drives the pipeline for reading files, extracting text, create vector embeddings and storing them in a vector database. It supports multiprocessing; a simple configuration directs how the workload is distributed across multiple processes which is useful in scaling up when you have to process millions of documents. The tool provides logs for each step in the pipeline ; you can control the verbosity, explainability in the logs by choosing the log level.
You can read about AIWhispr's design approach can be found at https://www.aiwhispr.com/post/high-level-overview-of-aiwhispr-design
Steps to install, configure AIWHispr can be found at https://github.com/prasaar/aiwhispr/blob/main/README.md
This blog explains how you can configure, activate the AIWhispr API service for semantic indexing and search.
Index Service
The indexing service is an API service that accepts a Json payload containing the text chunk for which this service will create a vector embedding and store in the vector database. The indexing service is started using the below command
python3 $AIWHISPR_HOME/python/flask-app/indexingService.py \
-H 127.0.0.1 -P 10001 -C <PATH_TO_CONFIG_FILE>
The index service is now ready to receive json payload on local IP address at port number 10001.
The configuration file is generally maintained under $AIWHISPR_HOME/config/index-service/
A typical figuration file for MongoDb as the vector database, SBert all-mpnet-base-v2 for vector embeddings:
[content-site]
sitename=<uniqe_name.myconfig>
srctype=index-service
srcpath=
displaypath=aiwhisprStreamlit
contentSiteModule=indexServiceContentSite
[content-site-auth]
authtype=index-service-key
access-key-id=<password_to_access_index_service>
[vectordb]
vectorDbModule=<mongodbVectorDb>
connection-string=mongodb+srv://mongodbUser:myPassword@cluster0.mongodb.mongodb.net/
dbname=<dbname>
collection-name=<collection_name>
vector-index=<atlas_search_vector_index_created_using_dashboard>
text-index=<atlas_search_text_index>
vector-dim=768
[llm-service]
llmServiceModule=libSbertLlmService
model-family=sbert
model-name=all-mpnet-base-v2
llm-service-api-key=
chunk-size=307
[local]
working-dir=/tmp
index-dir=/tmp
indexing-processes=1
Configure sitename to a unique name that describes the API service example:
sitename=my_first_api_service
Configure access-key-id to an authentication key value that should be contained in every json payload sent to the indexing service, example:
access-key-id=myAuthKeyFor1ndex!ng
A Json payload example:
{
"id": "UUID",
"content_path": "https://site.domain.com/mycontent1",
"tags": "TAG1 TAG2",
"text_chunk": "This is the text section for which a vector embedding will be created",
"text_chunk_no": 1,
"access_key_id": "myAuthKeyFor1ndex!ng"
}
The "id" field should be a unique 16 character UUID which will be stored in the vector database along with the vector embedding.
The field "content_path" is used to specify the location path of the document in the source system from which the text has been extracted.
The field "text_chunk" should be a UTF-8 text for which the vector embedding should be created. The number of words should be less than the sequence length(chunk-size) of the LLM service specified in the configuration.
The field "text_chunk_no" is the sequence number of the text chunk. This is used when you extract text from a large document. Example the sequence length
The directory $AIWHISPR_HOME/examples/index-service contains python based example to test the index service. The file nike_data.json contains 112 json records which are loaded by running the following commands:
cd $AIWHISPR_HOME/examples/index-service
./load_json.py
The python script load_json.py :
import json
import requests
url_post = "http://127.0.0.1:10001/index"
f_in = open("./nike_data.json")
data_in = json.load(f_in)
# The API endpoint to communicate with
for new_data in data_in:
# A POST request to the API
post_response = requests.post(url_post, json=new_data)
# Print the response
print(post_response)
A sample json record in the file nike_data.json :
{"id": "4b8de2f1-af57-5e5d-a208-371a2dfaa3ff",
"content_path": "https://www.nike.com/t/knicks-icon-edition-2020-nba-swingman-jersey-gxdwkz",
"tags": "Nike NBA Swingman Jersey",
"text_chunk": "THE HEART OF YOUR TEAM'S IDENTITY.The Icon Edition jersey represents the team's true colours, reflected in a distinct, instantly recognisable design. Directly inspired by what the pros wear, this New York Knicks Icon Edition Nike NBA Swingman Jersey is made from premium double-knit fabric, with classic jersey construction and a fit that looks good from all angles.The Only Jersey With Built-In DropsYour Nike NBA Connected Jersey gives you next-level access to athletes, exclusive offers and the game you love. Download the NikeConnect app, then tap your smartphone to the tag at the bottom of your jersey to get started.Made From Sustainable MaterialsEach Nike NBA Connected Jersey is made from 100% recycled polyester. The premium material comes from plastic bottles that Nike has diverted from landfills since 2012\u2014nearly 5 billion and counting.Product DetailsConnect to the game for the 2019\u20132020 season with your Nike NBA jersey.Machine wash100% recycled polyesterImportedHeat-applied graphicsProduct Details\nHeat-applied twill name and number\nAuthentic logos and colours\nFabric: 100% recycled polyester\nMachine wash\nImportedColour Shown: Rush BlueStyle: 864495-495Country/Region of Origin: Thailand,Guatemala\n\n",
"text_chunk_no": 1,
"access_key_id": "xyz"}
The access_key_id value is set as "xyz" which should be edited to the value you have chosen for access_key_id in the configuration file for index service.
The python script load_json.py is loading 112 such distinct records which contain description of products from nike.com by passing them as input to the index service running at
Search Service
The search service is an API service that takes a text query and returns semantic search based results. We can start the search service
python3 /Users/arunprasad/python-venv/aiwhispr/python/flask-app/searchService.py \
-H 127.0.0.1 -P 10002 -C <PATH_TO_CONFIG_FILE>
The configuration file is the same configuration file that was used for the index service. The search service is now ready to receive input search queries on local IP address at port 10002.
A search query "Jackets which are great for chilly weather" with the search results returned in json format containing both semantic and text(keyword) search results can be run as below:
curl -sS 'http://127.0.0.1:10002/aiwhispr?query=Jackets%20which%20%are%20great%20for%chilly%20weather&resultformat=json&withtextsearch=Y'
The top semantic search result is
"text_chunk": "FUZZY FIT FOR COLD TEMPS.The Jordan Jacket is made with plush sherpa fleece to keep you looking and feeling comfortable in cold temps. DNA details create instant hoops style.BenefitsFuzzy sherpa fleece feels warm and cozy inside and outside.The high collar helps block out the wind.Full-length zipper lets you adjust for a comfy fit.Product DetailsLoose fit for a roomy feel100% polyesterMachine washImportedShown: BlackStyle: 95A724-023",
The top text keyword search result is
"WATER-REPELLENT COVERAGE GETS PSG DETAILS.The Paris Saint-Germain Repel Academy AWF Jacket is a lightweight, water-repellent layer that's perfect for days when you want to play through the rain, but breathable where it matters when things heat up. This product is made with at least 75% recycled polyester fibers.BenefitsLightweight water-repellent fabric helps you stay dry in light rain.Zippered side pockets secure your essentials.High-heat zones are perforated for ventilation while you play.Product DetailsStandard fit for a relaxed, easy feelBody: 100% polyesterMachine washImportedShown: Dark Grey/Black/Siren Red/Siren RedStyle: DB8165-025",
.
Comments