Enhance Search with Azure SQL DB FAISS and Hugging Face Model

In today’s data-driven era, where we are dealing with massive data pertaining to different data sources having different data formats, it creates a very big challenge, especially in terms of efficient search and information retrieval. Dive into this deep ocean and locating the correct data is not at all an easy task with AI.

In this article, I’ll show you how you can integrate Azure SQL Database, FAISS, and advanced models from Hugging Face to enhance your search capabilities.

Let’s take a quick look at each of these pillars.

Azure SQL Database

Azure SQL Database is a cloud-based service that provides a totally managed database system. The high availability, scalability, and security of this particular service type make it the best option for handling large amounts of data.

FAISS

FAISS is short for Facebook AI Similarity Search. It is a library written by Facebook AI Research that enables efficient similarity search and clustering of dense vectors. It supports various indexing methods. You can find more details about it here.

Hugging face

Hugging Face is very well-known for its collection of natural language processing (NLP) models. These models are able to produce embeddings for text, which can then be sorted within FAISS and retrieved from it. You can find more details about the model here and we are going to use the same model.

Problem Statement

This article walks you through the process of querying your data stored in Azure SQL DB, generating the embeddings using a sentence transformer model named 'all-MiniLM-L6-v2' from Hugging Face, indexing the embeddings using FAISS library and then answering the user’s question based on the similarity search on the indexed embeddings.

Prerequisites

To replicate this flow, make sure that data is available in Azure SQL Database and that you’ve your connection string handy.

Integrating Azure SQL DB, FAISS, and Hugging Face

Here is the step-by-step walkthrough of how we can make these three systems talk to each other.

Import the Required Python Packages

Let's kick off the process by importing these packages and making sure to install them before importing.

import numpy as np  
import pandas as pd
import faiss
from sentence_transformers import SentenceTransformer
import pyodbc

Grab the Data from the Azure Sql Database

Here, the recommendation is to pick only those columns that are actually participating in answering the user's query. Here is the code to do so.

# Establish a connection
conn = "CONNECTION_STRING
conn = pyodbc.connect(conn)

# Define the SQL query to retrieve retrieve cols from the required columns
sql_query = 'SELECT title,description FROM dbo.MoviesAndSeries'

# Execute the query and read the results into a Pandas DataFrame
df = pd.read_sql_query(sql_query, conn)

# Close the connection
conn.close()

Here are a few rows from my data. I've taken this data set from Kaggle, and it contains data about TV shows and movies on Netflix.

Data

Select the Perfect Model from Hugging Face

I'm going with a sentence transformer model named all-MiniLM-L6-v2, which maps sentences and paragraphs to a given dimension. Here is the link to the model.

You can load this pre-trained model using the code below.

model = SentenceTransformer('all-MiniLM-L6-v2')

Convert Text to Vectors

Here, I am converting only the values from the description column to vectors as a title, I can get it from DataFrame named df directly.

# Convert all descriptions to vectors
vector = model.encode(df['description'].tolist(), convert_to_numpy=True)
description_vectors = vector / np.linalg.norm(vector, axis=1, keepdims=True)
# Create a FAISS index
dimension = description_vectors.shape[1]
index = faiss.IndexFlatIP(dimension)
faiss.normalize_L2(description_vectors)
# Add vectors to the index
index.add(description_vectors.astype('float32'))

If you want to know the theoretical aspects of IndexFlatIP, you can check it out on its official documentation.

Now, the next two lines are optional, depending on whether you want to save your indexed data and the data frame for future usage.

# Save the index to a file
faiss.write_index(index, "indexes/title_movie.index")

# Save the DataFrame to a CSV file for later use
df.to_csv("indexes/title_movie.csv", index=False)

Search the Top Matching Similar Records

Here comes the search part, wherein we will be doing multiple things.

  • Generate the embedding of user queries.
  • Search the indexed data for top_k matches.
  • For all the top_k matches, grab the respective title from the data frame.
  • Then, display the results to the user.

Here are the two functions taking care of all of these.

def get_data(rowNumber):
    row = df.iloc[rowNumber]
    response = {}
    response['title'] = row['title']
    response['description'] = row['description']
    return response

def search(query, top_k, index, model):
    query_vector = model.encode([query])
    top_k = index.search(query_vector,top_k)
    top_k_ids = top_k[1].tolist()[0]
    top_k_ids = list(np.unique(top_k_ids))
    results = [get_data(id) for id in top_k_ids]
    return results

Answer the User's Query

Now we have all the setup ready, it is time to answer the user's query.

query = "Which movie is based on the story of Ali Baba and the Forty Thieves"  
results = search(query, top_k=2,index=faiss.read_index(index_file_path),model=model)
for result in results:
    print('\n', result)

And here is the outcome.

Outcome

I hope you got an idea about how we can use an open-source model with an open-source similarity search library to ask questions from the Azure Sql Database.


Similar Articles