Concept #28Mediumpython-for-gen-ai

Explain embeddings. How would you choose an embedding model?

#gen-ai#embeddings#vector-db

Answer

Embeddings & Choosing an Embedding Model

An embedding is a dense numerical vector that represents the semantic meaning of text. Similar texts have similar vectors — enabling semantic search, clustering, and RAG.

How Embeddings Work

python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

texts = [
    "How do I reset my password?",
    "I forgot my login credentials",
    "The weather is sunny today",
]

embeddings = model.encode(texts)
print(embeddings.shape)  # (3, 384) — 384-dimensional vectors

# Similar meaning → similar vectors
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

sim_matrix = cosine_similarity(embeddings)
print(sim_matrix[0, 1])  # ~0.85 — "password reset" and "login credentials" are similar
print(sim_matrix[0, 2])  # ~0.12 — "password" and "weather" are unrelated

Popular Embedding Models

ModelDimensionsSpeedQualityCostBest For
text
text-embedding-3-small
1536FastGoodLow ($0.02/1M)General RAG
text
text-embedding-3-large
3072FastBestMedium ($0.13/1M)High accuracy
text
text-embedding-ada-002
1536FastGoodLowLegacy OpenAI
text
all-MiniLM-L6-v2
384Very fastModerateFree (local)Low-latency apps
text
BAAI/bge-large-en-v1.5
1024ModerateExcellentFree (local)Best open-source
text
intfloat/e5-large-v2
1024ModerateExcellentFree (local)Multilingual

How to Choose an Embedding Model

python
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import InformationRetrievalEvaluator

# 1. Evaluate on your specific domain data
queries = {"q1": "What is the refund policy?"}
corpus = {"d1": "Refunds are accepted within 30 days."}
relevant_docs = {"q1": {"d1"}}

for model_name in ["all-MiniLM-L6-v2", "BAAI/bge-large-en-v1.5"]:
    model = SentenceTransformer(model_name)
    evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)
    results = evaluator(model)
    print(f"{model_name}: NDCG@10 = {results['cosine_ndcg@10']:.4f}")

Choosing Based on Requirements

RequirementRecommended Model
Lowest cost, OpenAI API
text
text-embedding-3-small
Highest quality, OpenAI API
text
text-embedding-3-large
Free, best quality
text
BAAI/bge-large-en-v1.5
Fastest local inference
text
all-MiniLM-L6-v2
Multilingual documents
text
intfloat/multilingual-e5-large
Code search
text
flax-sentence-embeddings/st-codesearch-distilroberta-base
Long documents (>512 tokens)
text
jinaai/jina-embeddings-v2-base-en

Matryoshka Embeddings (OpenAI text-embedding-3)

OpenAI's

text
text-embedding-3
models support dimension reduction without retraining:

python
from openai import OpenAI
client = OpenAI()

# Full 1536 dimensions
full_emb = client.embeddings.create(
    model="text-embedding-3-small",
    input="What is RAG?"
).data[0].embedding

# Reduce to 512 dimensions (faster search, lower storage)
small_emb = client.embeddings.create(
    model="text-embedding-3-small",
    input="What is RAG?",
    dimensions=512
).data[0].embedding

Decision framework: Start with

text
text-embedding-3-small
for OpenAI-based apps (best cost-quality ratio). If you need local/private embeddings, use
text
BAAI/bge-large-en-v1.5
. Always evaluate on a sample of your domain data — benchmark results don't always reflect real-world performance.