There are 4 ways to setup semantic search for your files stored in Azure containers. You can
Mount the Azure container (blob storage) as file mount using fuse and configure AIWhispr to read the files from a directory path.
Mount the Azure container(file share) as a file mount using cifs-utils python package.
Configure AIWhispr to read the files using a Storage Account Key.
Configure AIWhispr to read the files using a Azure SAS (Shared Access Key) token.
Each approach has its own benefits.
In this blog we will cover how to index the files in Azure Blob Storage container mounted as a filesystem mount.
Indexing files on Azure for semantic search : Fuse mount
BlobFuse is a virtual file system driver for Azure Blob Storage allowing you to access your storage account through the Linux file system.
Install BlobFuse2
sudo wget \ https://packages.microsoft.com/config/ubuntu/20.04/packages-microsoft-prod.deb
sudo dpkg -i packages-microsoft-prod.deb
sudo apt-get update
sudo apt-get install blobfuse
sudo apt install libfuse2
Use an SSD as a temporary path
In Azure, you may use the ephemeral disks (SSD) available on your VMs to provide a low-latency buffer for BlobFuse. Depending on the provisioning agent used, the ephemeral disk would be mounted on '/mnt' for cloud-init or '/mnt/resource' for waagent VMs.
Make sure your user has access to the temporary path:
sudo mkdir /mnt/resource/blobfusetmp -p
sudo chown <youruser> /mnt/resource/blobfusetmp
Authorize access to your storage account
You can authorize access to your storage account by using the account access key, a shared access signature, a managed identity, or a service principal. Authorization information can be provided on the command line, in a config file, or in environment variables.
In this example, suppose you are authorizing with the account access keys and storing them in a config file. The config file should have the following format:
accountName <myaccount>
accountKey <storageaccesskey>
containerName <mycontainer>
authType Key
The accountName is the name of your storage account, and not the full URL. You need to update myaccount, storageaccesskey, and mycontainer with your storage information.
Create this file using:
sudo touch /path/to/fuse_connection.cfg
sudo chmod 600 /path/to/fuse_connection.cfg
Create an empty directory for mounting the blob storage container
mkdir ~/mycontainer
Mount
If you use an ADLS account, you must include --use-adls=true.
sudo blobfuse ~/mycontainer --tmp-path=/mnt/resource/blobfusetmp \ /path/to/fuse_connection.cfg -o attr_timeout=240 -o entry_timeout=240 -o \ negative_timeout=120 -o allow_other
To persist the mount across machine restart , refer to full documentation
AIWhispr configuration
A configuration file for AIWhispr would look like
#### CONFIGFILE FILEPATH, QDRANT VECTORDB ####
[content-site]
sitename=example_azure.fuse.qdrant
srctype=filepath
srcpath=/path/to/mycontainer
displaypath=http://127.0.0.1:9000/mycontainer
contentSiteModule=filepathContentSite
[content-site-auth]
authtype=filechecks
check-file-permission=Y
[vectordb]
vectorDbModule=qdrantVectorDb
api-address= localhost
api-port=6333
api-key=
[local]
working-dir=/tmp
index-dir=/tmp
[llm-service]
model-family=sbert
model-name=all-mpnet-base-v2
llm-service-api-key= llmServiceModule=libSbertLlmService
You can store this config as
$AIWHISPR_HOME/config/content-site/sites-available/example_azure.fuse.qdrant.cfg
Start Indexing
To start indexing run the command
$AIWHISPR_HOME/shell/start-indexing-content-site.sh \
-C $AIWHISPR_HOME/config/content-site/sites-available/example_azure.fuse.qdrant.cfg
Create an index.html
cd ~
cp $AIWHISPR_HOME/examples/nginx/aiwhispr.html ~/index.html
Start the search services
Search Service API Server
($AIWHISPR_HOME/shell/start-search-service.sh -H 127.0.0.1 -P 5002 \
-C $AIWHISPR_HOME/config/content-site/sites-available/example_azure.fuse.qdrant.cfg &> /tmp/example_bbc.filepath.qdrant.searchservice.log & )
API Results to HTTP formatter
(python3 $AIWHISPR_HOME/examples/http/exampleHttpResponder.py &> /tmp/example_bbc.exampleHttpResponder.log &)
Web Server
cd ~
(python3 -m http.server 9000 &> /tmp/example_azure.fuse.qdrant.httpServer.log &);
Ready for semantic search
Try the search on http://127.0.0.1:9000
Comments