Vector database
Vector databases (VectorDBs) are specialized databases designed to store and query high-dimensional vectors efficiently. These vectors often represent embeddings generated by machine learning models, such as word embeddings, image embeddings, or other types of feature representations. VectorDBs excel at similarity search tasks, where the goal is to find items that are "close" to a given query vector in terms of some distance metric (e.g., cosine similarity, Euclidean distance).
Using pgvector for a PostgreSQL Vector Database
Pgvector is an open-source extension for PostgreSQL that enables storing and searching over machine learning-generated embeddings. It provides different capabilities that let users identify exact and approximate nearest neighbors. Pgvector is designed to work seamlessly with other PostgreSQL features, including indexing and querying. https://github.com/pgvector/pgvector
Why Use Vector Embeddings?
Traditional keyword-based search systems rely on exact matches or predefined categories, which can miss nuanced connections between queries and data. By contrast, vector embeddings enable semantic search , where the meaning of the text is captured and compared. For example:
- A query like "a lion cub grows up to become king" would match The Lion King, even if the exact words aren’t present in the description.
- A query like "a man stranded on another planet" would match The Martian.
This approach significantly improves the relevance and flexibility of search results.
What is Microsoft.Extensions.VectorData?
Microsoft.Extensions.VectorData is a set of core .NET libraries developed in collaboration with Semantic Kernel and the broader .NET ecosystem. These libraries provide a unified layer of C# abstractions for interacting with vector stores.
The abstractions in Microsoft.Extensions.VectorData provides library authors and developers with the following functionality:
- Perform Create-Read-Update-Delete (CRUD) operations on vector stores
- Use vector and text search on vector stores.
Semantic Kernel Out-of-the-box Vector Store connectors (Preview)
Vector Store Connectors |
C# |
Uses officially supported SDK |
Maintainer / Vendor |
Azure AI Search |
? |
? |
Microsoft Semantic Kernel Project |
Cosmos DB MongoDB |
? |
? |
Microsoft Semantic Kernel Project |
Cosmos DB No SQL |
? |
? |
Microsoft Semantic Kernel Project |
Elasticsearch |
? |
? |
Elastic |
Chroma |
Planned |
|
|
In-Memory |
? |
N/A |
Microsoft Semantic Kernel Project |
Milvus |
Planned |
|
|
MongoDB |
? |
? |
Microsoft Semantic Kernel Project |
Pinecone |
? |
? |
Microsoft Semantic Kernel Project |
Postgres |
? |
? |
Microsoft Semantic Kernel Project |
Qdrant |
? |
? |
Microsoft Semantic Kernel Project |
Redis |
? |
? |
Microsoft Semantic Kernel Project |
Sql Server |
Planned |
|
|
SQLite |
? |
? |
Microsoft Semantic Kernel Project |
Volatile (In-Memory) |
Deprecated (use In-Memory) |
N/A |
Microsoft Semantic Kernel Project |
Weaviate |
? |
? |
Microsoft Semantic Kernel Project |
GitHub Models
GitHub Models is an exciting feature that allows developers to interact with various large language models (LLMs) directly on GitHub. Here are some key points about GitHub Models:
- Integration with GitHub: GitHub Models enables developers to use industry-leading AI models within their development environment. This includes models from Meta, Mistral, Azure OpenAI Service, Microsoft, and others.
- Model Playground: Developers can experiment with different models in an interactive playground. This feature allows users to test prompts and model parameters before integrating them into their projects.
- Ease of Use: GitHub Models simplifies the process of accessing and using AI models. It provides a seamless experience from testing in the playground to deploying in production environments like Codespaces and Azure.
- Responsible AI: GitHub ensures that no prompts or outputs from GitHub Models are shared with model providers or used to train or improve the models, maintaining privacy and security.
GitHub Models: Find more information on the GitHub Marketplace
EmbeddingGenerator - AzureOpenAIClient
IEmbeddingGenerator<string, Embedding<float>> generator =
new AzureOpenAIClient(
new Uri("<https://models.inference.ai.azure.com>"),
new AzureKeyCredential(githubKey)
).AsEmbeddingGenerator(modelId: "text-embedding-3-small");
var result = await generator.GenerateEmbeddingAsync("What is AI?");
Console.WriteLine("embeddings.Count" + result.Vector.Length);
// foreach (var value in result.Vector.ToArray())
// {
// Console.WriteLine("{0:0.00}, ", value);
// }
EmbeddingGenerator - Ollama
IEmbeddingGenerator<string, Embedding<float>> generator =
new OllamaEmbeddingGenerator(
new Uri("<http://localhost:11434/>"),
"all-minilm"
);
Embedding dimensions (vector sizes) for some of the most popular models
Model |
Embedding Dimension (Vector Size) |
Vendor |
Word2Vec |
300 |
Google |
GloVe (Common Crawl) |
50, 100, 200, 300 |
Stanford University |
FastText |
300 |
Facebook (Meta) |
BERT (Base) |
768 |
Google |
BERT (Large) |
1024 |
Google |
RoBERTa (Base) |
768 |
Facebook (Meta) |
RoBERTa (Large) |
1024 |
Facebook (Meta) |
DistilBERT |
768 |
Hugging Face |
ALBERT (Base) |
768 |
Google |
ALBERT (Large) |
1024 |
Google |
XLNet (Base) |
768 |
Google & CMU |
XLNet (Large) |
1024 |
Google & CMU |
T5 (Small) |
512 |
Google |
T5 (Base) |
768 |
Google |
T5 (Large) |
1024 |
Google |
GPT-2 (Small) |
768 |
OpenAI |
GPT-2 (Medium) |
1024 |
OpenAI |
GPT-2 (Large) |
1280 |
OpenAI |
GPT-3 |
12288 (12K) |
OpenAI |
CLIP (ViT-B/32) |
512 |
OpenAI |
CLIP (ViT-L/14) |
768 |
OpenAI |
Sentence-BERT (Base) |
768 |
Hugging Face |
DPR (Dense Passage Retrieval) |
768 |
Facebook (Meta) |
all-MiniLM-L6-v2 |
384 |
Sentence-Transformers (Hugging Face) |
all-MiniLM-L12-v2 |
384 |
Sentence-Transformers (Hugging Face) |
Llama 2 (7B) |
4096 |
Meta (via Ollama) |
Llama 2 (13B) |
5120 |
Meta (via Ollama) |
Llama 2 (70B) |
8192 |
Meta (via Ollama) |
Mistral 7B |
4096 |
Mistral AI (via Ollama) |
Vicuna 13B |
5120 |
LMSYS (via Ollama) |
Setting Up your Development Environment
To get started with AI development in .NET, follow these steps.
- Choose a Language: Select a programming language that suits your project's requirements, such as C#, Java, or JavaScript.
- Install .NET: Ensure you have the latest version of .NET installed on your machine.
- Choose an IDE: Select an Integrated Development Environment (IDE) like Visual Studio, IntelliJ IDEA, or Visual Studio Code, depending on your language choice.
Create a dotnet console application
dotnet new console -o pgvectorcs
Add packages
dotnet add package Azure.AI.OpenAI --version 2.1.0
dotnet add package Azure.Identity --version 1.13.2
dotnet add package dotnetenv --version 3.1.1
dotnet add package Microsoft.Extensions.AI --version 9.0.0-preview.9.24556.5
dotnet add package Microsoft.Extensions.AI.AzureAIInference --version 9.0.0-preview.9.24556.5
dotnet add package Microsoft.Extensions.AI.Ollama --version 9.0.0-preview.9.24556.5
dotnet add package Microsoft.Extensions.AI.OpenAI --version 9.0.0-preview.9.24556.5
dotnet add package Microsoft.Extensions.VectorData.Abstractions --version 9.0.0-preview.1.24523.1
dotnet add package Microsoft.SemanticKernel.Connectors.InMemory --version 1.29.0-preview
dotnet add package Newtonsoft.Json --version 13.0.3
dotnet add package Npgsql --version 9.0.2
dotnet add package Pgvector --version 0.3.0
Overview of the System
The goal of this application is to allow users to input a query (e.g., "a story about space exploration") and retrieve the most relevant movie from a database. The relevance is determined by comparing the embeddings of the query and the movie descriptions stored in the database.
Here’s how the system works.
- Embedding Generation: Descriptions of movies are converted into numerical vectors (embeddings) using Azure OpenAI's text-embedding-3-small model.
- Database Storage: These embeddings are stored in a PostgreSQL database using the pgvector extension, which supports efficient similarity searches.
- Semantic Search: When a user enters a query, its embedding is generated and compared to the stored movie embeddings using cosine similarity. The most similar movie is returned as the result.
Key Components of the Code
1. Environment Setup
The application relies on environment variables for configuration, such as the PostgreSQL connection string and the Azure OpenAI API key. These are loaded using the DotNetEnv library:
Env.Load(".env");
string githubKey = Env.GetString("GITHUB_KEY");
string connectionString = Env.GetString("POSTGRES_CONNECTION_STRING");
This ensures sensitive information is not hardcoded and can be easily managed.
2. PostgreSQL Database Initialization
The database is initialized with the pgvector extension, which enables vector storage and similarity searches. A movies table is created to store the movie metadata and their embeddings:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE IF NOT EXISTS movies (
key SERIAL PRIMARY KEY,
title TEXT,
description TEXT,
vector vector(1536)
);
Each movie record includes.
- key: A unique identifier for the movie.
- title: The movie's title.
- description: A brief description of the movie.
- vector: The embedding vector representing the movie's description.
The vector column uses the vector(1536) type, where 1536 corresponds to the dimensionality of the embeddings generated by the text-embedding-3-small model.
3. Embedding Generation
The AzureOpenAIClient is used to generate embeddings for both movie descriptions and user queries. The GenerateEmbeddingVectorAsync method converts text into a vector representation.
var queryEmbedding = await generator.GenerateEmbeddingVectorAsync(query);
These embeddings capture the semantic meaning of the text, enabling meaningful comparisons between different pieces of text.
4. Saving Movie Data
Each movie’s embedding is saved in the PostgreSQL database using the SaveVector method.
await using (var command = new NpgsqlCommand(@"
INSERT INTO movies (key, title, description, vector)
VALUES (@key, @title, @description, @vector)
ON CONFLICT (key) DO UPDATE SET
title = EXCLUDED.title,
description = EXCLUDED.description,
vector = EXCLUDED.vector", conn))
{
command.Parameters.AddWithValue("key", movie.Key);
command.Parameters.AddWithValue("title", movie.Title);
command.Parameters.AddWithValue("description", movie.Description);
command.Parameters.Add(new NpgsqlParameter("vector", new Vector(movie.Vector)));
await command.ExecuteNonQueryAsync();
}
This ensures that the database is updated with the latest embeddings, even if a movie already exists.
![Database]()
5. Performing a Vector Search
When a user enters a query, the system generates an embedding for the query and performs a similarity search in the database.
SELECT title, description, vector <-> @queryVector AS score
FROM movies
ORDER BY vector <-> @queryVector
LIMIT 1;
The <-> operator computes the distance between two vectors (smaller distances indicate higher similarity). The query returns the most relevant movie along with its similarity score.
The results are encapsulated in a SearchResult object.
public class SearchResult
{
public string Title { get; set; }
public string Description { get; set; }
public double Score { get; set; }
}
6. User Interaction
The application provides a simple console interface for users to input queries.
Console.Write("Enter your search query (press Enter to quit): ");
query = Console.ReadLine();
if (string.IsNullOrEmpty(query))
{
return;
}
If the user enters a query, the system generates its embedding, performs the search, and displays the most relevant movie.
var queryEmbedding = await generator.GenerateEmbeddingVectorAsync(query);
var result = await SearchVector(conn, queryEmbedding);
if (result != null)
{
Console.WriteLine($"Title: {result.Title}");
Console.WriteLine($"Description: {result.Description}");
Console.WriteLine($"Score: {result.Score}");
}
![Search query]()
Challenges and Considerations
- Embedding Dimensionality: The text-embedding-3-small model generates 1536-dimensional vectors. While this captures rich semantic information, it requires sufficient storage and computational resources.
- Scalability: As the dataset grows, similarity searches may become slower. Techniques like indexing (e.g., IVF or HNSW) can help optimize performance.
- Cost: Using Azure OpenAI for embedding generation incurs costs, especially for large datasets. Caching embeddings locally can reduce API usage.
Reference