
This blog shows how to setup a vector embedding pipeline using AIWhispr , SBert sentence transformer model, Qdrant vector database. This will enable you to setup semantic search over 2000+ news stories from BBC whose vector embeddings are created using SBert
all-mpnet-base-v2 model; the vector embeddings and extracted text is stored in Qdrant vector database.
Installing Qdrant on MacOS
Use Docker to install Qdrant vector database on MacOS. If you don't have Docker installed yet, then follow the instructions on https://docs.docker.com/desktop/install/mac-install/ to install the Docker Desktop. Then run the following commands to for docker based installation of qdrant.
docker pull qdrant/qdrant
Create a separate directory to store vector database files
mkdir $HOME/storage
mkdir $HOME/storage/vectordb
mkdir $HOME/storage/vectordb/qdrant
Setup API based authentication for Qdrant installation by specifying an api key in a configuration file
echo "service:" > $HOME/storage/vectordb/qdrant_custom_config.yaml
echo " api_key: api_key" >> $HOME/storage/vectordb/qdrant_custom_config.yaml
Your api_key should be preferably be at least 9 characters long, preferably random string.
Run the Qdrant service
docker run -d -p 6333:6333 \
-v $HOME/storage/vectordb/qdrant_custom_config.yaml:/qdrant/config/custom_config.yaml \
-v $HOME/storage/vectordb/qdrant:/qdrant/storage \
qdrant/qdrant \
./qdrant --config-path config/custom_config.yaml
Setup AIWhispr
Pull the latest version of AIWhispr from https://github.com/prasaar/aiwhispr
Setup environment variables
AIWHISPR_HOME_DIR environment variable should be the full path to aiwhispr directory.
AIWHISPR_LOG_LEVEL environment variable can be set to DEBUG / INFO / WARNING / ERROR
AIWHISPR_HOME=/<installpath>/aiwhispr
AIWHISPR_LOG_LEVEL=DEBUG
export AIWHISPR_HOME
export AIWHISPR_LOG_LEVEL
Remember to add the environment variables in your shell login script
Install Python Packages
Run the following command
$AIWHISPR_HOME/shell/install_python_packages.sh
If uwsgi install is failing then ensure you have gcc, python-dev , python3-dev installed.
sudo apt-get install gcc
sudo apt install python-dev
sudo apt install python3-dev
pip3 install uwsgi
Configure vector embedding pipeline
AIWhispr comes with a streamlit app to help you get started.
Run the streamlit app
cd $AIWHISPR_HOME/python/streamlit
streamlit run ./Configure_Content_Site.py &
There are 3 steps to configure the pipeline for indexing your content for semantic search.
Configure Content Sites : Provide details of the storage location that hosts your content (files).
Configure Vector DB : Provide connection details of your Vector DB in which the vector embeddings of your content will be stored.
Configure LLM Service : Provide the large language model details (SBert/OpenAI) which will be used to encode your content into vector embeddings.
You should see the first step of your configuration. Specify a distinct unique name for this vector embedding pipeline configuration.

Fig: Start of configuration of vector embedding pipeline
Click on the button "Use This Content Site Config" and proceed to the next step to configure vector database connection by clicking on "Configure Vector Db" in left sidebar.
Now select Qdrant vector database and specify the API Key for Qdrant you had configured earlier.

Fig: Configure vector database
Click on the button "Use This Vector Db Config" and then move to the next step by clicking on "Configure LLM Service" in the left sidebar.
Select S-Bert model. For SBert model family, the default model used is all-mpnet-base-v2.

Fig: Configure SBert LLM Service for vector embeddings
Click on the button "Use This LLM Service Config" to create the final version of your vector embedding pipeline configuration file.
The contents of the configuration file and its location on your machine will be displayed.
You can test this configuration by clicking on "Test Config File" in the left sidebar.
Test the vector embedding pipeline configuration
You should now see a message that shows the location of your vector embedding pipeline configuration file and a button "Test Configfile"

Fig: Test configuration
Clicking on the button will start the process which will test the pipeline configuration for
connecting to the storage location
connecting to the vector database
encoding a sample query using the LLM Service

You should see "NO ERRORS" message at the end of the logs which informs you that this pipeline configuration can be used.
Fig: snapshot of end of testing of config file
Run the configured vector embedding pipeline
Click on "Run Indexing process" in the left sidebar to start the pipeline.
You should see a "Start Indexing" button on the main section

Fig: Start Indexing
Click on this button to start the pipeline.

The logs are updated every 15 seconds. The default example indexes 2000+ BBC news stories which takes approximately 20 minutes on Apple M1(8GB RAM)
Don't navigate away from this page while the indexing process is running i.e. while the Streamlit "Running" status is displayed on the top right.
Fig: snapshot of indexing process logs.
You can also check if the indexing process is running on your machine
ps -ef | grep python3 | grep index_content_site.py
At the end of ending you should see

Fig: Success message at end of indexing process
Semantic Search
You can now run semantic search queries.
A semantic plot that displays the cosine distance, and a top 3 PCA analysis, for the search results is also displayed alongwith the text search results..

Fig : Semantic Plot of the search results

You can also navigate the PCA Map for a selected search result by clicking on the search result.

Fig: PCA Map for unique words in a search result document
Comments