top of page

Setting up semantic search for files in Azure Storage: Part 1 (mount BlobStorage)

  • Writer: Arun Prasad
    Arun Prasad
  • Aug 22, 2023
  • 2 min read

Updated: Aug 30, 2023

There are 4 ways to setup semantic search for your files stored in Azure containers. You can

  1. Mount the Azure container (blob storage) as file mount using fuse and configure AIWhispr to read the files from a directory path.

  2. Mount the Azure container(file share) as a file mount using cifs-utils python package.

  3. Configure AIWhispr to read the files using a Storage Account Key.

  4. Configure AIWhispr to read the files using a Azure SAS (Shared Access Key) token.

Each approach has its own benefits.


In this blog we will cover how to index the files in Azure Blob Storage container mounted as a filesystem mount.


Indexing files on Azure for semantic search : Fuse mount

BlobFuse is a virtual file system driver for Azure Blob Storage allowing you to access your storage account through the Linux file system.


Install BlobFuse2


sudo wget \ https://packages.microsoft.com/config/ubuntu/20.04/packages-microsoft-prod.deb
sudo dpkg -i packages-microsoft-prod.deb
sudo apt-get update
sudo apt-get install blobfuse
sudo apt install libfuse2


Use an SSD as a temporary path

In Azure, you may use the ephemeral disks (SSD) available on your VMs to provide a low-latency buffer for BlobFuse. Depending on the provisioning agent used, the ephemeral disk would be mounted on '/mnt' for cloud-init or '/mnt/resource' for waagent VMs.

Make sure your user has access to the temporary path:

sudo mkdir /mnt/resource/blobfusetmp -p
sudo chown <youruser> /mnt/resource/blobfusetmp

Authorize access to your storage account

You can authorize access to your storage account by using the account access key, a shared access signature, a managed identity, or a service principal. Authorization information can be provided on the command line, in a config file, or in environment variables.

In this example, suppose you are authorizing with the account access keys and storing them in a config file. The config file should have the following format:

accountName <myaccount>

accountKey <storageaccesskey>

containerName <mycontainer>

authType Key


The accountName is the name of your storage account, and not the full URL. You need to update myaccount, storageaccesskey, and mycontainer with your storage information.

Create this file using:


sudo touch /path/to/fuse_connection.cfg
sudo chmod 600 /path/to/fuse_connection.cfg

Create an empty directory for mounting the blob storage container

mkdir ~/mycontainer


Mount

If you use an ADLS account, you must include --use-adls=true.


sudo blobfuse ~/mycontainer --tmp-path=/mnt/resource/blobfusetmp  	\ /path/to/fuse_connection.cfg -o attr_timeout=240 -o entry_timeout=240 -o \ negative_timeout=120 -o allow_other

To persist the mount across machine restart , refer to full documentation

AIWhispr configuration

A configuration file for AIWhispr would look like


#### CONFIGFILE FILEPATH, QDRANT VECTORDB #### 
[content-site] 
sitename=example_azure.fuse.qdrant 
srctype=filepath 
srcpath=/path/to/mycontainer
displaypath=http://127.0.0.1:9000/mycontainer
contentSiteModule=filepathContentSite 
[content-site-auth] 
authtype=filechecks 
check-file-permission=Y
[vectordb]
vectorDbModule=qdrantVectorDb 
api-address= localhost 
api-port=6333 
api-key=
[local] 
working-dir=/tmp
index-dir=/tmp
[llm-service] 
model-family=sbert 
model-name=all-mpnet-base-v2 
llm-service-api-key= llmServiceModule=libSbertLlmService

You can store this config as

$AIWHISPR_HOME/config/content-site/sites-available/example_azure.fuse.qdrant.cfg


Start Indexing

To start indexing run the command

$AIWHISPR_HOME/shell/start-indexing-content-site.sh \  
-C $AIWHISPR_HOME/config/content-site/sites-available/example_azure.fuse.qdrant.cfg

Create an index.html


cd ~
cp $AIWHISPR_HOME/examples/nginx/aiwhispr.html ~/index.html

Start the search services


Search Service API Server

($AIWHISPR_HOME/shell/start-search-service.sh -H 127.0.0.1 -P 5002 \ 
-C $AIWHISPR_HOME/config/content-site/sites-available/example_azure.fuse.qdrant.cfg &> /tmp/example_bbc.filepath.qdrant.searchservice.log & )

API Results to HTTP formatter

(python3 $AIWHISPR_HOME/examples/http/exampleHttpResponder.py &> /tmp/example_bbc.exampleHttpResponder.log &)

Web Server

cd ~
(python3 -m http.server 9000 &> /tmp/example_azure.fuse.qdrant.httpServer.log &);

Ready for semantic search

Try the search on http://127.0.0.1:9000



Recent Posts

See All
AIWhispr 0.938 Release

We continue integrating new vector databases and embedding services towards our goal of a no/low-code semantic search tool that is easy...

 
 
 

Comments


bottom of page