top of page
Search

Setting up a vector embedding pipeline with AIWhispr, Bert and Qdrant on MacOS.

Updated: Oct 18, 2023







This blog shows how to setup a vector embedding pipeline using AIWhispr , SBert sentence transformer model, Qdrant vector database. This will enable you to setup semantic search over 2000+ news stories from BBC whose vector embeddings are created using SBert

all-mpnet-base-v2 model; the vector embeddings and extracted text is stored in Qdrant vector database.


Installing Qdrant on MacOS


Use Docker to install Qdrant vector database on MacOS. If you don't have Docker installed yet, then follow the instructions on https://docs.docker.com/desktop/install/mac-install/ to install the Docker Desktop. Then run the following commands to for docker based installation of qdrant.

docker pull qdrant/qdrant

Create a separate directory to store vector database files

mkdir $HOME/storage
mkdir $HOME/storage/vectordb
mkdir $HOME/storage/vectordb/qdrant

Setup API based authentication for Qdrant installation by specifying an api key in a configuration file


echo "service:" > $HOME/storage/vectordb/qdrant_custom_config.yaml
echo "   api_key: api_key" >> $HOME/storage/vectordb/qdrant_custom_config.yaml

Your api_key should be preferably be at least 9 characters long, preferably random string.


Run the Qdrant service



docker run -d -p 6333:6333   \
-v $HOME/storage/vectordb/qdrant_custom_config.yaml:/qdrant/config/custom_config.yaml \
-v $HOME/storage/vectordb/qdrant:/qdrant/storage \
qdrant/qdrant \
 ./qdrant --config-path config/custom_config.yaml



Setup AIWhispr


Pull the latest version of AIWhispr from https://github.com/prasaar/aiwhispr



AIWHISPR_HOME_DIR environment variable should be the full path to aiwhispr directory.

AIWHISPR_LOG_LEVEL environment variable can be set to DEBUG / INFO / WARNING / ERROR




AIWHISPR_HOME=/<installpath>/aiwhispr 
AIWHISPR_LOG_LEVEL=DEBUG 
export AIWHISPR_HOME 
export AIWHISPR_LOG_LEVEL

Remember to add the environment variables in your shell login script


Install Python Packages

Run the following command


$AIWHISPR_HOME/shell/install_python_packages.sh

If uwsgi install is failing then ensure you have gcc, python-dev , python3-dev installed.


sudo apt-get install gcc 
sudo apt install python-dev
sudo apt install python3-dev
pip3 install uwsgi



Configure vector embedding pipeline

AIWhispr comes with a streamlit app to help you get started.

Run the streamlit app



cd $AIWHISPR_HOME/python/streamlit
streamlit run ./Configure_Content_Site.py &

There are 3 steps to configure the pipeline for indexing your content for semantic search.

  • Configure Content Sites : Provide details of the storage location that hosts your content (files).

  • Configure Vector DB : Provide connection details of your Vector DB in which the vector embeddings of your content will be stored.

  • Configure LLM Service : Provide the large language model details (SBert/OpenAI) which will be used to encode your content into vector embeddings.

You should see the first step of your configuration. Specify a distinct unique name for this vector embedding pipeline configuration.

Fig: Start of configuration of vector embedding pipeline


Click on the button "Use This Content Site Config" and proceed to the next step to configure vector database connection by clicking on "Configure Vector Db" in left sidebar.


Now select Qdrant vector database and specify the API Key for Qdrant you had configured earlier.

Fig: Configure vector database


Click on the button "Use This Vector Db Config" and then move to the next step by clicking on "Configure LLM Service" in the left sidebar.


Select S-Bert model. For SBert model family, the default model used is all-mpnet-base-v2.

Fig: Configure SBert LLM Service for vector embeddings


Click on the button "Use This LLM Service Config" to create the final version of your vector embedding pipeline configuration file.

The contents of the configuration file and its location on your machine will be displayed.

You can test this configuration by clicking on "Test Config File" in the left sidebar.




Test the vector embedding pipeline configuration


You should now see a message that shows the location of your vector embedding pipeline configuration file and a button "Test Configfile"


Fig: Test configuration


Clicking on the button will start the process which will test the pipeline configuration for

  • connecting to the storage location

  • connecting to the vector database

  • encoding a sample query using the LLM Service


You should see "NO ERRORS" message at the end of the logs which informs you that this pipeline configuration can be used.




Fig: snapshot of end of testing of config file




Run the configured vector embedding pipeline


Click on "Run Indexing process" in the left sidebar to start the pipeline.


You should see a "Start Indexing" button on the main section


Fig: Start Indexing


Click on this button to start the pipeline.


The logs are updated every 15 seconds. The default example indexes 2000+ BBC news stories which takes approximately 20 minutes on Apple M1(8GB RAM)

Don't navigate away from this page while the indexing process is running i.e. while the Streamlit "Running" status is displayed on the top right.



Fig: snapshot of indexing process logs.


You can also check if the indexing process is running on your machine

ps -ef | grep python3 | grep index_content_site.py 

At the end of ending you should see


Fig: Success message at end of indexing process


Semantic Search


You can now run semantic search queries.

A semantic plot that displays the cosine distance, and a top 3 PCA analysis, for the search results is also displayed alongwith the text search results..



 

Fig : Semantic Plot of the search results


Fig: PCA MAP for search result documents

 

You can also navigate the PCA Map for a selected search result by clicking on the search result.



Fig: PCA Map for unique words in a search result document




125 views0 comments

Recent Posts

See All

AIWhispr 0.938 Release

We continue integrating new vector databases and embedding services towards our goal of a no/low-code semantic search tool that is easy to configure and deploy. Our first release supported a single ve

コメント


bottom of page