In the spirit of open source, we want to ensure that users, collaborators of AIWhispr also understand how to add their own custom modules. Open Source for AIWHispr does not mean just putting the source code out there , but also to share the modular design approach and practical examples of how to add your own modules.
This example is a code walkthrough of the modules added to support reading of files stored on Google Cloud Storage for indexing for semantic search,
1. Edit: python/common_objects/aiwhisprBaseClasses.py
Support for Google storage authentication configurations was added in the siteAuth base class.
class siteAuth:
....
def __init__(self,auth_type:str,**kwargs):
....
match self.auth_type
....
case 'google-cred-key':
.....
self.google_cred_path = kwargs['google_cred_path']
self.google_project_id = kwargs['google_project_id']
self.google_storage_api_key = kwargs['google_storage_api_key']
.....
This means that [content-site-auth] section for Google cloud storage looks like
------------------------------------------------------------------------------------------------------------------- ....................
[content-site-auth] authtype=google-cred-key google-cred-path=</path/to/google/credential/file.json> google-project-id=<project-id-to which-storage-is-billed-google> google-storage-api-key=<api-key-to-access-google-storage>
.................
-------------------------------------------------------------------------------------------------------------------
***Note the the configurations use dash "-" separator while the python variable names use underscore "_"
We also edit the baseClass srcContentSite.pickle_me() to supports these new variables to be pickled. The pickled objects are used by the spawned indexing processes.
class srcContentSite:
....
def pickle_me(self):
....
self.self_description['site_auth']['google_cred_path'] = self.site_auth.google_cred_path
self.self_description['site_auth']['google_project_id'] = self.site_auth.google_project_id
self.self_description['site_auth']['google_storage_api_key'] = self.site_auth.google_storage_api_key
2. Edit: python/common-functions/index_content_site_for_config.py
This python module reads the config file and support for configuration for Google Cloud Storage Access is added which is then used to instantiate the siteAuth object. The source type is defined as "google-cloud" in the configuration.
-------------------------------------------------------------------------------------------------------------------
[content-site]
srctype=google-cloud
.................
-------------------------------------------------------------------------------------------------------------------
match src_type:
....
case 'google-cloud':
....
match auth_type:
case 'google-cred-key':
google_cred_path = config.get('content-site-auth','google-cred-path')
google_project_id = config.get('content-site-auth','google-project-id')
google_storage_api_key = config.get('content-site-auth','google-storage-api-key')
logger.debug('google_cred_path : %s google_project_id : %s google_storage_api_key: %s', google_cred_path, google_project_id, google_storage_api_key)
if(len(google_cred_path) == 0 or len(google_project_id) == 0 or len(google_storage_api_key) == 0):
logger.error('Could not read google-cred-path, google-project-id, google-storage-api-key')
else:
site_auth=siteAuth(auth_type=auth_type,
google_cred_path=google_cred_path ,
google_project_id=google_project_id,
google_storage_api_key=google_storage_api_key
)
....
3. Add : python/common-objects/googleBlobDownloader.py
A class is added to handle downloading of files from Google Cloud Storage
# Imports the Google Cloud client library
from google.cloud import storage
from google.oauth2.service_account import Credentials
class googleBlobDownloader(object):
# <Snippet_download_blob_file>
def download_blob_to_file(self, storage_client: storage.Client, bucket_name:str, blob_flat_name:str, download_file_name:str):
bucket = storage_client.bucket(bucket)
blob = bucket.blob(blob_flat_name)
blob.download_to_filename(download_file_name)
4. Add: python/content-site/googleContentSite.py
A new class is added to orchestrate reading of files from Google Cloud storage and indexing them for semantic search. The existing azureContentSite.py is used as a starting point. The changes are minimal and limited to Google Cloud related operations to list contents of the bucket and download the files.
# Import the Google Cloud client library and newly added googleBlobDownloader
from google.cloud import storage
from google.oauth2.service_account import Credentials
from googleBlobDownloader import googleBlobDownloader
The init pattern is similar to other cloud sources (AWS S3, Azure )
The main changes are highlighted in red.
class createContentSite(srcContentSite):
downloader:googleBlobDownloader
def __init__(self,content_site_name:str,
src_path:str,
src_path_for_results:str,
working_directory:str,
index_log_directory:str,
site_auth:siteAuth,
vector_db:vectorDb,
llm_service:baseLlmService,
do_not_read_dir_list:list = [],
do_not_read_file_list:list = []):
srcContentSite.__init__(self,content_site_name=content_site_name,
src_type="google-cloud",
src_path=src_path,
src_path_for_results=src_path_for_results,
working_directory=working_directory,
index_log_directory=index_log_directory,
site_auth=site_auth,
vector_db=vector_db,
llm_service = llm_service,
do_not_read_dir_list=do_not_read_dir_list,
do_not_read_file_list=do_not_read_file_list)
self.bucket_name = src_path.split('/')[2]
self.downloader = googleBlobDownloader()
self.logger = logging.getLogger(__name__)
The function connect_to_content_site() will contain the steps to connect to Google Cloud.
Its uses the variables set in the self.site_auth to set the storage_client variable to the Google Cloud connection object.
def connect_to_content_site(self):
# Connect to Google Cloud
# Create the StorageClient object
match self.site_auth.auth_type:
case "google-cred-key":
self.logger.info('Connecting to Google Cloud using Credentials and Key')
self.google_creds = Credentials.from_service_account_file(
self.site_auth.google_cred_path
)
self.storage_client = storage.Client(
client_options={"api_key": self.site_auth.google_storage_api_key,
"quota_project_id": self.site_auth.google_project_id } ,
credentials=self.google_creds
)
case other:
self.logger.error('No authentication provided for Google Cloud connection')
The index() function contains the steps to read each object metadata
# List the blobs in the container
blob_list = self.storage_client.list_blobs(self.bucket_name)
bucket = self.storage_client.bucket(self.bucket_name)
for blob in blob_list:
blob_metadata = bucket.get_blob(blob.name)
#Insert this list in the index database
#Get metadata for each file
content_file_suffix = pathlib.PurePath(blob.name).suffix
content_index_flag = 'N' #default
content_path = blob_metadata.name
content_type = blob_metadata.content_type
content_last_modified_date = blob_metadata.updated
content_creation_date = content_last_modified_date
content_uniq_id_src = blob_metadata.etag
content_tags_from_src = ''
content_size = blob_metadata.size
content_processed_status = "N"
In index_from_list() function the site_auth instance, contentSite instance and
downloader.download_blob_to_file function arguments now reflect google cloud storage related parameters.
.....
site_auth= siteAuth( auth_type=self_description['site_auth']['auth_type'],
google_cred_path=self_description['site_auth']['google_cred_path'],
google_storage_api_key=self_description['site_auth']['google_storage_api_key'],
google_project_id=self_description['site_auth']['google_project_id']
)
.....
contentSite = initializeContentSite.initialize(content_site_module='googleContentSite',
.....
contentSite.downloader.download_blob_to_file(contentSite.storage_client, contentSite.bucket_name, content_path, download_file_path)
.....
That's it, we are ready to index files on Google Cloud storage for semantic search in AIWhispr.
Comments