Code Walkthrough : Adding support for indexing files on Google Cloud Storage

Arun Prasad
Sep 1, 2023
3 min read

In the spirit of open source, we want to ensure that users, collaborators of AIWhispr also understand how to add their own custom modules. Open Source for AIWHispr does not mean just putting the source code out there , but also to share the modular design approach and practical examples of how to add your own modules.

This example is a code walkthrough of the modules added to support reading of files stored on Google Cloud Storage for indexing for semantic search,

1. Edit: python/common_objects/aiwhisprBaseClasses.py

Support for Google storage authentication configurations was added in the siteAuth base class.

class siteAuth:
....
def __init__(self,auth_type:str,**kwargs):
....   
match self.auth_type
....   
  case 'google-cred-key':                
.....
     self.google_cred_path = kwargs['google_cred_path']
     self.google_project_id = kwargs['google_project_id']
     self.google_storage_api_key = kwargs['google_storage_api_key']
.....

This means that [content-site-auth] section for Google cloud storage looks like

------------------------------------------------------------------------------------------------------------------- ....................

[content-site-auth] authtype=google-cred-key google-cred-path=</path/to/google/credential/file.json> google-project-id=<project-id-to which-storage-is-billed-google> google-storage-api-key=<api-key-to-access-google-storage>

.................

-------------------------------------------------------------------------------------------------------------------

***Note the the configurations use dash "-" separator while the python variable names use underscore "_"

We also edit the baseClass srcContentSite.pickle_me() to supports these new variables to be pickled. The pickled objects are used by the spawned indexing processes.

class srcContentSite:
....
 def pickle_me(self):
....
   self.self_description['site_auth']['google_cred_path'] = self.site_auth.google_cred_path
   self.self_description['site_auth']['google_project_id'] = self.site_auth.google_project_id
   self.self_description['site_auth']['google_storage_api_key'] = self.site_auth.google_storage_api_key

2. Edit: python/common-functions/index_content_site_for_config.py

This python module reads the config file and support for configuration for Google Cloud Storage Access is added which is then used to instantiate the siteAuth object. The source type is defined as "google-cloud" in the configuration.

-------------------------------------------------------------------------------------------------------------------

[content-site]

srctype=google-cloud

.................

-------------------------------------------------------------------------------------------------------------------


match src_type:
....
 case 'google-cloud':
....
    match auth_type:
      case 'google-cred-key':
        google_cred_path = config.get('content-site-auth','google-cred-path')
        google_project_id = config.get('content-site-auth','google-project-id')
        google_storage_api_key = config.get('content-site-auth','google-storage-api-key')
        logger.debug('google_cred_path : %s google_project_id : %s google_storage_api_key: %s', google_cred_path, google_project_id, google_storage_api_key)
        if(len(google_cred_path) == 0 or len(google_project_id) == 0 or len(google_storage_api_key) == 0):
         logger.error('Could not read google-cred-path, google-project-id, google-storage-api-key')
        else:
         site_auth=siteAuth(auth_type=auth_type, 
                        google_cred_path=google_cred_path ,
                        google_project_id=google_project_id,
                        google_storage_api_key=google_storage_api_key
                        )
  ....

3. Add : python/common-objects/googleBlobDownloader.py

A class is added to handle downloading of files from Google Cloud Storage


# Imports the Google Cloud client library
from google.cloud import storage
from google.oauth2.service_account import Credentials


class googleBlobDownloader(object):

    # <Snippet_download_blob_file>
    def download_blob_to_file(self, storage_client: storage.Client, bucket_name:str, blob_flat_name:str, download_file_name:str):
        
        bucket = storage_client.bucket(bucket)
        blob = bucket.blob(blob_flat_name)
        blob.download_to_filename(download_file_name)

4. Add: python/content-site/googleContentSite.py

A new class is added to orchestrate reading of files from Google Cloud storage and indexing them for semantic search. The existing azureContentSite.py is used as a starting point. The changes are minimal and limited to Google Cloud related operations to list contents of the bucket and download the files.


# Import the Google Cloud client library and newly added googleBlobDownloader
from google.cloud import storage
from google.oauth2.service_account import Credentials

from googleBlobDownloader import googleBlobDownloader

The init pattern is similar to other cloud sources (AWS S3, Azure )

The main changes are highlighted in red.


class createContentSite(srcContentSite):
            
    downloader:googleBlobDownloader

    def __init__(self,content_site_name:str,
                 src_path:str,
                 src_path_for_results:str,
                 working_directory:str,
                 index_log_directory:str,
                 site_auth:siteAuth,
                 vector_db:vectorDb,
                 llm_service:baseLlmService, 
                 do_not_read_dir_list:list = [], 
                 do_not_read_file_list:list = []):
       srcContentSite.__init__(self,content_site_name=content_site_name,
                               src_type="google-cloud",
                               src_path=src_path,
                               src_path_for_results=src_path_for_results,
                               working_directory=working_directory,
                               index_log_directory=index_log_directory,
                               site_auth=site_auth,
                               vector_db=vector_db, 
                               llm_service = llm_service, 
                               do_not_read_dir_list=do_not_read_dir_list,
                               do_not_read_file_list=do_not_read_file_list)
       self.bucket_name = src_path.split('/')[2]
       self.downloader = googleBlobDownloader()
       self.logger = logging.getLogger(__name__)

The function connect_to_content_site() will contain the steps to connect to Google Cloud.

Its uses the variables set in the self.site_auth to set the storage_client variable to the Google Cloud connection object.


def connect_to_content_site(self):
   # Connect to Google Cloud
   # Create the StorageClient object
   match self.site_auth.auth_type:
     case "google-cred-key":
       self.logger.info('Connecting to Google Cloud using Credentials and Key')
       self.google_creds =  Credentials.from_service_account_file(
       self.site_auth.google_cred_path
       )
       self.storage_client = storage.Client( 
       client_options={"api_key":  self.site_auth.google_storage_api_key, 
                       "quota_project_id": self.site_auth.google_project_id } , 
                        credentials=self.google_creds
                                )
     case other:
       self.logger.error('No authentication provided for Google Cloud connection')

The index() function contains the steps to read each object metadata


# List the blobs in the container
        blob_list = self.storage_client.list_blobs(self.bucket_name)
        bucket = self.storage_client.bucket(self.bucket_name)
        for blob in blob_list:
            blob_metadata = bucket.get_blob(blob.name)
            #Insert this list in the index database
            #Get metadata for each file
            content_file_suffix = pathlib.PurePath(blob.name).suffix          
            content_index_flag = 'N' #default
            content_path = blob_metadata.name
            content_type = blob_metadata.content_type
            content_last_modified_date = blob_metadata.updated
            content_creation_date = content_last_modified_date
            content_uniq_id_src = blob_metadata.etag
            content_tags_from_src = ''
            content_size = blob_metadata.size
            content_processed_status = "N"

In index_from_list() function the site_auth instance, contentSite instance and

downloader.download_blob_to_file function arguments now reflect google cloud storage related parameters.



.....
site_auth= siteAuth( auth_type=self_description['site_auth']['auth_type'],
   google_cred_path=self_description['site_auth']['google_cred_path'],
   google_storage_api_key=self_description['site_auth']['google_storage_api_key'],
   google_project_id=self_description['site_auth']['google_project_id']
                                )
.....   
contentSite = initializeContentSite.initialize(content_site_module='googleContentSite',
.....
contentSite.downloader.download_blob_to_file(contentSite.storage_client, contentSite.bucket_name, content_path, download_file_path) 
.....

That's it, we are ready to index files on Google Cloud storage for semantic search in AIWhispr.

Code Walkthrough : Adding support for indexing files on Google Cloud Storage

Recent Posts

Comments