Microsoft.Extensions.VectorData using the PostgreSQL

Vector database

Vector databases (VectorDBs) are specialized databases designed to store and query high-dimensional vectors efficiently. These vectors often represent embeddings generated by machine learning models, such as word embeddings, image embeddings, or other types of feature representations. VectorDBs excel at similarity search tasks, where the goal is to find items that are "close" to a given query vector in terms of some distance metric (e.g., cosine similarity, Euclidean distance).

Using pgvector for a PostgreSQL Vector Database

Pgvector is an open-source extension for PostgreSQL that enables storing and searching over machine learning-generated embeddings. It provides different capabilities that let users identify exact and approximate nearest neighbors. Pgvector is designed to work seamlessly with other PostgreSQL features, including indexing and querying. https://github.com/pgvector/pgvector

Why Use Vector Embeddings?

Traditional keyword-based search systems rely on exact matches or predefined categories, which can miss nuanced connections between queries and data. By contrast, vector embeddings enable semantic search , where the meaning of the text is captured and compared. For example:

  • A query like "a lion cub grows up to become king" would match The Lion King, even if the exact words aren’t present in the description.
  • A query like "a man stranded on another planet" would match The Martian.

This approach significantly improves the relevance and flexibility of search results.

What is Microsoft.Extensions.VectorData?

Microsoft.Extensions.VectorData is a set of core .NET libraries developed in collaboration with Semantic Kernel and the broader .NET ecosystem. These libraries provide a unified layer of C# abstractions for interacting with vector stores.

The abstractions in Microsoft.Extensions.VectorData provides library authors and developers with the following functionality:

  • Perform Create-Read-Update-Delete (CRUD) operations on vector stores
  • Use vector and text search on vector stores.

Semantic Kernel Out-of-the-box Vector Store connectors (Preview)
 

Vector Store Connectors C# Uses officially supported SDK Maintainer / Vendor
Azure AI Search ? ? Microsoft Semantic Kernel Project
Cosmos DB MongoDB ? ? Microsoft Semantic Kernel Project
Cosmos DB No SQL ? ? Microsoft Semantic Kernel Project
Elasticsearch ? ? Elastic
Chroma Planned    
In-Memory ? N/A Microsoft Semantic Kernel Project
Milvus Planned    
MongoDB ? ? Microsoft Semantic Kernel Project
Pinecone ? ? Microsoft Semantic Kernel Project
Postgres ? ? Microsoft Semantic Kernel Project
Qdrant ? ? Microsoft Semantic Kernel Project
Redis ? ? Microsoft Semantic Kernel Project
Sql Server Planned    
SQLite ? ? Microsoft Semantic Kernel Project
Volatile (In-Memory) Deprecated (use In-Memory) N/A Microsoft Semantic Kernel Project
Weaviate ? ? Microsoft Semantic Kernel Project


GitHub Models

GitHub Models is an exciting feature that allows developers to interact with various large language models (LLMs) directly on GitHub. Here are some key points about GitHub Models:

  1. Integration with GitHub: GitHub Models enables developers to use industry-leading AI models within their development environment. This includes models from Meta, Mistral, Azure OpenAI Service, Microsoft, and others.
  2. Model Playground: Developers can experiment with different models in an interactive playground. This feature allows users to test prompts and model parameters before integrating them into their projects.
  3. Ease of Use: GitHub Models simplifies the process of accessing and using AI models. It provides a seamless experience from testing in the playground to deploying in production environments like Codespaces and Azure.
  4. Responsible AI: GitHub ensures that no prompts or outputs from GitHub Models are shared with model providers or used to train or improve the models, maintaining privacy and security.

GitHub Models: Find more information on the GitHub Marketplace

EmbeddingGenerator - AzureOpenAIClient

IEmbeddingGenerator<string, Embedding<float>> generator = 
    new AzureOpenAIClient(
        new Uri("<https://models.inference.ai.azure.com>"), 
        new AzureKeyCredential(githubKey)
    ).AsEmbeddingGenerator(modelId: "text-embedding-3-small");
var result = await generator.GenerateEmbeddingAsync("What is AI?");
Console.WriteLine("embeddings.Count" + result.Vector.Length);

// foreach (var value in result.Vector.ToArray())
// {
//     Console.WriteLine("{0:0.00}, ", value);
// }

EmbeddingGenerator - Ollama

IEmbeddingGenerator<string, Embedding<float>> generator = 
    new OllamaEmbeddingGenerator(
        new Uri("<http://localhost:11434/>"), 
        "all-minilm"
    );

Embedding dimensions (vector sizes) for some of the most popular models
 

Model Embedding Dimension (Vector Size) Vendor
Word2Vec 300 Google
GloVe (Common Crawl) 50, 100, 200, 300 Stanford University
FastText 300 Facebook (Meta)
BERT (Base) 768 Google
BERT (Large) 1024 Google
RoBERTa (Base) 768 Facebook (Meta)
RoBERTa (Large) 1024 Facebook (Meta)
DistilBERT 768 Hugging Face
ALBERT (Base) 768 Google
ALBERT (Large) 1024 Google
XLNet (Base) 768 Google & CMU
XLNet (Large) 1024 Google & CMU
T5 (Small) 512 Google
T5 (Base) 768 Google
T5 (Large) 1024 Google
GPT-2 (Small) 768 OpenAI
GPT-2 (Medium) 1024 OpenAI
GPT-2 (Large) 1280 OpenAI
GPT-3 12288 (12K) OpenAI
CLIP (ViT-B/32) 512 OpenAI
CLIP (ViT-L/14) 768 OpenAI
Sentence-BERT (Base) 768 Hugging Face
DPR (Dense Passage Retrieval) 768 Facebook (Meta)
all-MiniLM-L6-v2 384 Sentence-Transformers (Hugging Face)
all-MiniLM-L12-v2 384 Sentence-Transformers (Hugging Face)
Llama 2 (7B) 4096 Meta (via Ollama)
Llama 2 (13B) 5120 Meta (via Ollama)
Llama 2 (70B) 8192 Meta (via Ollama)
Mistral 7B 4096 Mistral AI (via Ollama)
Vicuna 13B 5120 LMSYS (via Ollama)


Setting Up your Development Environment

To get started with AI development in .NET, follow these steps.

  1. Choose a Language: Select a programming language that suits your project's requirements, such as C#, Java, or JavaScript.
  2. Install .NET: Ensure you have the latest version of .NET installed on your machine.
  3. Choose an IDE: Select an Integrated Development Environment (IDE) like Visual Studio, IntelliJ IDEA, or Visual Studio Code, depending on your language choice.

Create a dotnet console application

dotnet new console -o pgvectorcs

Add packages

dotnet add package Azure.AI.OpenAI --version 2.1.0
dotnet add package Azure.Identity --version 1.13.2
dotnet add package dotnetenv --version 3.1.1
dotnet add package Microsoft.Extensions.AI --version 9.0.0-preview.9.24556.5
dotnet add package Microsoft.Extensions.AI.AzureAIInference --version 9.0.0-preview.9.24556.5
dotnet add package Microsoft.Extensions.AI.Ollama --version 9.0.0-preview.9.24556.5
dotnet add package Microsoft.Extensions.AI.OpenAI --version 9.0.0-preview.9.24556.5
dotnet add package Microsoft.Extensions.VectorData.Abstractions --version 9.0.0-preview.1.24523.1
dotnet add package Microsoft.SemanticKernel.Connectors.InMemory --version 1.29.0-preview
dotnet add package Newtonsoft.Json --version 13.0.3
dotnet add package Npgsql --version 9.0.2
dotnet add package Pgvector --version 0.3.0

Overview of the System

The goal of this application is to allow users to input a query (e.g., "a story about space exploration") and retrieve the most relevant movie from a database. The relevance is determined by comparing the embeddings of the query and the movie descriptions stored in the database.

Here’s how the system works.

  1. Embedding Generation: Descriptions of movies are converted into numerical vectors (embeddings) using Azure OpenAI's text-embedding-3-small model.
  2. Database Storage: These embeddings are stored in a PostgreSQL database using the pgvector extension, which supports efficient similarity searches.
  3. Semantic Search: When a user enters a query, its embedding is generated and compared to the stored movie embeddings using cosine similarity. The most similar movie is returned as the result.

Key Components of the Code
 

1. Environment Setup

The application relies on environment variables for configuration, such as the PostgreSQL connection string and the Azure OpenAI API key. These are loaded using the DotNetEnv library:

Env.Load(".env");

string githubKey = Env.GetString("GITHUB_KEY");
string connectionString = Env.GetString("POSTGRES_CONNECTION_STRING");

This ensures sensitive information is not hardcoded and can be easily managed.

2. PostgreSQL Database Initialization

The database is initialized with the pgvector extension, which enables vector storage and similarity searches. A movies table is created to store the movie metadata and their embeddings:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS movies (
    key SERIAL PRIMARY KEY,
    title TEXT,
    description TEXT,
    vector vector(1536)
);

Each movie record includes.

  • key: A unique identifier for the movie.
  • title: The movie's title.
  • description: A brief description of the movie.
  • vector: The embedding vector representing the movie's description.

The vector column uses the vector(1536) type, where 1536 corresponds to the dimensionality of the embeddings generated by the text-embedding-3-small model.

3. Embedding Generation

The AzureOpenAIClient is used to generate embeddings for both movie descriptions and user queries. The GenerateEmbeddingVectorAsync method converts text into a vector representation.

var queryEmbedding = await generator.GenerateEmbeddingVectorAsync(query);

These embeddings capture the semantic meaning of the text, enabling meaningful comparisons between different pieces of text.

4. Saving Movie Data

Each movie’s embedding is saved in the PostgreSQL database using the SaveVector method.

await using (var command = new NpgsqlCommand(@"
    INSERT INTO movies (key, title, description, vector)
    VALUES (@key, @title, @description, @vector)
    ON CONFLICT (key) DO UPDATE SET
        title = EXCLUDED.title,
        description = EXCLUDED.description,
        vector = EXCLUDED.vector", conn))
{
    command.Parameters.AddWithValue("key", movie.Key);
    command.Parameters.AddWithValue("title", movie.Title);
    command.Parameters.AddWithValue("description", movie.Description);
    command.Parameters.Add(new NpgsqlParameter("vector", new Vector(movie.Vector)));
    await command.ExecuteNonQueryAsync();
}

This ensures that the database is updated with the latest embeddings, even if a movie already exists.

Database

5. Performing a Vector Search

When a user enters a query, the system generates an embedding for the query and performs a similarity search in the database.

SELECT title, description, vector <-> @queryVector AS score
FROM movies
ORDER BY vector <-> @queryVector
LIMIT 1;

The <-> operator computes the distance between two vectors (smaller distances indicate higher similarity). The query returns the most relevant movie along with its similarity score.

The results are encapsulated in a SearchResult object.

public class SearchResult
{
    public string Title { get; set; }
    public string Description { get; set; }
    public double Score { get; set; }
}

6. User Interaction

The application provides a simple console interface for users to input queries.

Console.Write("Enter your search query (press Enter to quit): ");
query = Console.ReadLine();

if (string.IsNullOrEmpty(query))
{
    return;
}

If the user enters a query, the system generates its embedding, performs the search, and displays the most relevant movie.

var queryEmbedding = await generator.GenerateEmbeddingVectorAsync(query);
var result = await SearchVector(conn, queryEmbedding);
if (result != null)
{
    Console.WriteLine($"Title: {result.Title}");
    Console.WriteLine($"Description: {result.Description}");
    Console.WriteLine($"Score: {result.Score}");
}

Search query

Challenges and Considerations

  1. Embedding Dimensionality: The text-embedding-3-small model generates 1536-dimensional vectors. While this captures rich semantic information, it requires sufficient storage and computational resources.
  2. Scalability: As the dataset grows, similarity searches may become slower. Techniques like indexing (e.g., IVF or HNSW) can help optimize performance.
  3. Cost: Using Azure OpenAI for embedding generation incurs costs, especially for large datasets. Caching embeddings locally can reduce API usage.

Reference

Next Recommended Reading PostgreSQL - Some Basics