AIWhispr is a tool to enable AI powered semantic search on documents
It is easy to install.
Simple to configure.
Can handle multiple file formats (txt,csv, pdf, docx,pptx, docx) stored on AWS S3, Azure Blob Containers, local directory path.
Delivers fast semantic response to search queries.
Enables you to integrate with LLM's and vector databases of your choice; you can write you own custom modules or leverage the inbuilt modules in AIWhispr.
Adding support for a vector database is 2 simple steps, including
Add a module for the vector database. To support Qdrant we have released a module qdrantVectorDb.py
Mention this module name in the [vectordb] section of the configuration file
Prerequisites for a Linux install with Qdrant as vector database
Environment variables
AIWHISPR_HOME_DIR environment variable should be the full path to aiwhispr directory.
AIWHISPR_LOG_LEVEL environment variable can be set to DEBUG / INFO / WARNING / ERROR
AIWHISPR_HOME=/<...>/aiwhispr
AIWHISPR_LOG_LEVEL=DEBUG
export AIWHISPR_HOME
export AIWHISPR_LOG_LEVEL
Download Qdrant and run the service
docker pull qdrant/qdrant
Run the service
docker run -p 6333:6333 \
-v $(pwd)/qdrant_storage:/qdrant/storage \
qdrant/qdrant
Qdrant should be accessible at localhost:6333
Python packages
Install python package for AIWhispr
$AIWHISPR_HOME/shell/install_python_packages.sh
Your first setup
1. Configuration file
A configuration file is maintained under $AIWHISPR_HOME/config/content-site/sites-available
We will use the example under examples/http to create a config file to index over 2000+ files which contain BBC news content.
To create the config file run the following commands.
You can enter "N" and choose to go with the default values
cd $AIWHISPR_HOME/examples/http;
./configure_example_filepath_qdrant.sh
It will display a config file that has been created.
#### CONFIG FILE ####
[content-site]
sitename=example_bbc.filepath.qdrant
srctype=filepath
srcpath=/<aiwhispr_home>/aiwhispr/examples/http/bbc
displaypath=http://127.0.0.1:9000/bbc
contentSiteModule=filepathContentSite
[content-site-auth]
authtype=filechecks
check-file-permission=Y
[vectordb]
vectorDbModule=qdrantVectorDb
api-address= localhost
api-port= 6333
api-key=
[local]
working-dir=/<aiwhispr_home>/aiwhispr/examples/http/working-dir
index-dir=/<aiwhispr_home>/aiwhispr/examples/http/working-dir
[llm-service]
model-family=sbert
model-name=all-mpnet-base-v2
llm-service-api-key=
llmServiceModule=libSbertLlmService
Check that config file has been created.
ls $AIWHISPR_HOME/config/content-site/sites-available/example_bbc.filepath.qdrant.cfg
2. Start Indexing
Confirm that the environment variables AIWHISPR_HOME and AIWHISPR_LOG_LEVEL are set and exported. Index the file content for semantic search. This will take some time as it has to process over 2000 files. The logging set to DEBUG will generate a verbose output.
$AIWHISPR_HOME/shell/start-indexing-content-site.sh \
-C $AIWHISPR_HOME/config/content-site/sites-available/example_bbc.filepath.qdrant.cfg
This dataset has been sourced from https://www.kaggle.com/datasets/shivamkushwaha/bbc-full-text-document-classification
3. Start the AIWhispr search service
3 services will be started under this shell script.
the AIWhispr searchService(port:5002) which interfaces with the vectordb
a flask python script(port:9001) that takes in user query , sends the query to AIWhispr searchService and formats the results for HTML display
a python http.server(port 9000)
The log files for these 3 processes is created in /tmp/
cd $AIWHISPR_HOME/examples/http;
$AIWHISPR_HOME/examples/http/start_search_filepath_qdrant.sh
4. Ready for search
Try the search on http://127.0.0.1:9000
Try queries like
"What are the top TV moments in Olympics"
"Which is the best laptop to buy"
"How is inflation impacting the economy"
Kommentare