Choosing a right Embedding model for your AI use case is critical for quality outcomes

Embeddings

Just to quickly recap if you haven't gone through the link, embeddings form the foundation for achieving precise and contextually relevant LLM outputs. The semantic understanding captured by embeddings facilitates accurate matching between queries and context. While text embedding is a specific use case, embeddings can also be utilised for various types of data, such as images, audio, graphs, and more.

There’s an abundance of important factors and trade-offs that should be considered when choosing an embedding model for your use case.

Embedding models are neural net models (e.g., transformers) that convert unstructured and complex data, such as text documents, images, audios, videos, or even tabular data, into numerical vectors (i.e. embeddings). They also capture their semantic meanings. These vectors serve as representations/indices for data points and are essential building blocks for semantic search and retrieval-augmented generation (RAG), which is the predominant approach used by chatbots and many AI applications.

Embeddings, we generally work with,

Sparse —focusing on relative word weights per document, resulting in a more efficient and interpretable system
Dense — focusing on capturing the overall semantic meaning of words or phrases, making them suitable for tasks like dense retrieval
Multivector — model designs where the interaction between query and document representations occurs late in the process, after both have been independently encoded.
Long Context Dense — models have algorithms that can encapsulate information into dense representations in a multi-dimensional space
Code — offers semantic understanding, allowing it to interpret the intent behind queries related to code snippets or functionalities.

Embeddings Model Players among plethora

Cohere

has a variety of models that cover many different use cases. From generational models taking instructions to models that can be used for estimating semantic similarity between two sentences, choosing a sentence which is most likely to follow another sentence, or categorising user feedback Cohere offers –

Generate,
Command,
Embed,
Rerank

models are available on it’s proprietary, Amazon Sagemaker, Amazon Bedrock, Microsoft Azure, Oracle GenAI service platforms.

Open AI

OpenAI offers two powerful third-generation embedding models that can be used for use cases like,

Search (where results are ranked by relevance to a query string),
Clustering (where text strings are grouped by similarity),
Recommendations (where items with related text strings are recommended),
Anomaly detection (where outliers with little relatedness are identified),
Diversity measurement (where similarity distributions are analysed), and
Classification (where text strings are classified by their most similar label)

Voyage

Voyage AI provides cutting-edge embedding and rerankers. Voyage offers,

Instruction-tuned general-purpose embedding model optimised for clustering, classification, and retrieval.
Specific models for legal and long context retrieval,
code retrieval,
general-purpose embedding model which are optimized for either retrieval quality, or,
general-purpose embedding model which are balanced between cost, latency and retrieval quality.

Mistrel

offers both open-weight models that are highly efficient and available under a fully permissive Apache 2 license and optimised commercial models.

Open-weights models likes of Mistral 7B, Mixtral 8x7B, Mixtral 8x22B

Optimised commercial models that are designed for high performance, with customised deployment options.

Optimised commercial models likes of Mistral Small, Mistral Medium, Mistral Large, and Mistral Embeddings.

Why selection is crucial?

The selection of most appropriate embedding model has a significant impact on the overall relevance and usability of RAG application. Which encoder you select to generate embeddings is a critical decision, hugely impacting the overall success of the RAG system. Low quality embeddings lead to poor retrieval.

Let’s review some of the selection criteria to consider before making your decision.

What all you should be considering?

Vector Dimension and Performance Evaluation -When selecting an embedding model, consider the vector dimension. However, custom performance evaluation on your dataset is essential for accurate performance assessment.

Reliability of APIs - Ensure high availability of the embedding API service. OpenAI and similar providers offer reliable APIs, while open-source models may require additional engineering efforts. Indexing Cost - The cost of indexing documents is influenced by the chosen encoder service Storage Cost of Embeddings - Storage cost increases direcly with the number of dimensions, and the choice of embeddings, this impacts the overall cost. Calculate average units per document to estimate storage cost. Search Latency- The latency of semantic search grows with the dimension of embeddings. Opt for low dimensional embeddings if you have to minimize latency during search. Language Support - Choose a multilingual encoder or use a translation system alongside an English encoder to support non-English languages. Privacy Concerns- Stringent data privacy requirements, especially in sensitive domains like finance and healthcare, may influence the choice of embedding services. Evaluate privacy considerations before selecting a provider. Granularity of text- Various levels of granularity, including word-level, sentence-level, and document-level representations, influence the depth of semantic information embedded. For example, optimizing relevance and minimizing noise in the embedding process can be achieved by segmenting large text into smaller chunks.

Major factors you should watch out for

Average (Retrieval)

Represents average (Normalized Discounted Cumulative Gain @k = 10). It measures the performance of retrieval systems. A higher number indicates a model that is better at ranking relevant items higher in the list of retrieved results.

Model Size

It gives an idea of the computational resources required to run the model.

While retrieval performance scales with model size, it is important to note that model size also has a direct impact on latency.

Max Tokens

Number of tokens that can be compressed into a single embedding. You typically don’t want to put more than a single paragraph of text (~100 tokens) into a single embedding. So even models with max tokens of 512 should be more than enough.

Embedding Dimensions

It represents the length of the embedding vector.

Smaller dimensions offer faster inference and are more storage-efficient, while, more dimensions can capture nuanced details and relationships in the data.

Score

the score we should focus on is “average” and “retrieval average”.

Both are highly correlated, so focusing on either works.

Sequence Length

tells us how many tokens a model can consume and compress into a single embedding.

It should be equal or near to the tokens put into a single embedding.

Embedding Model Performance Leaderboards

MTEB —Evaluates the model performance across 8 tasks and 58 datasets.
BEIR — Benchmarking Information Retrieval supports metrics related to nine tasks
ESB — End-to-end Speech Benchmark compares embedding metrics used for matching speaking styles
HEAR — Holistic Evaluation of Audio Representations benchmarks audio classification and labelling tasks across speech, environmental sound and music

MTEB Leaderboard on Hugging Face

Benchmarks are a good place to begin but bear in mind that these results are self-reported and have been benchmarked on datasets that might not accurately represent the data you are dealing with. So even if you choose a model based on benchmark results, we recommend evaluating it on your dataset.

A good place to start when looking for embedding models to use is the MTEB Leaderboard on Hugging Face. MTEB evaluates the performance of the Embedding model across 8 tasks (text mining, classification, clustering, pair classification, reranking, retrieval, semantic textual similarity (STS), and summarization) and 58 datasets.

MTEB Leaderboard has 9 tabs, covering Overall score plus 8 tabs for each task types:

Overall - representing an average performance across these tasks.
Bitext Mining (Metric: F1) - finding parallel sentences in two languages.
Classification (Metric: Accuracy)
Clustering (Metric: Validity Measure (v_measure))
Pair Classification (Metric: Average Precision based on Cosine Similarities (cos_sim_ap))
Reranking (Metric: Mean Average Precision (MAP))
Retrieval (Metric: Normalized Discounted Cumulative Gain @ k (ndcg_at_10))
STS (Metric: Spearman correlation based on cosine similarity)
Summarization (Metric: Spearman correlation based on cosine similarity)

Approach of choosing the Embedding Model

The tradeoff between quality and latency is real and we should always remember about it when choosing an embedding model.

1. [Match it ] Understand the use case, what business functionality actually wants to achieve. Look for appropriate tab in benchmark. 2. [Maximise] Assess Performance, how well embeddings align with my task requirements, ensuring their overall performance meets the specific needs of use case. 3. [Try to strike a balance] Consider Model Size, which correlates with the operational costs associated with storing and loading the model into memory. There’s a trade-off between a larger, more complex model offering enhanced performance and a smaller model with reduced storage and computational requirements. Find the right balance that aligns with the performance goals and resource constraints. 4. [Optimise it] Sequence Length [also called Context Vector Size], Determine the maximum number of input tokens the you need to support. Choose a model with an optimal sequence length that allows the model to effectively capture dependencies in the input data, ensuring robust performance in sequential tasks. 5. [Try to strike a balance] Dimension Size, While smaller dimensions may reduce computational load, they could impact the model’s ability to capture complex relationships in the data. Keep in mind the operational cost of storing and comparing embeddings based on their dimension size and striking a balance. 6. [Match it] Language: Ensure accurate and meaningful embeddings by aligning the model with my language requirements.

5 Conclusive remarks

There’s an abundance of factors and trade-offs that should be considered when choosing an embedding model for a use case.

1. Always remember Upgradability: When building a system with an embedding model you should plan for changes since better models are released all the time and often it’s the simplest way to improve the performance of your system. 2. Look out for Low Latency requirements: If you intend to use a model in a low-latency scenario, it’s better to focus on latency first and then see which models with acceptable latency have the best-in-class performance. Larger models have higher inference time which can severely limit their use in low latency scenarios as often an embedding model is a pre-processing step in a larger pipeline. 3. Lower Dimensional models: For tasks emphasising efficiency over capturing all aspects in a particular language, a smaller vocabulary with lower-dimensional embeddings may offer a favourable trade-off. 4. Wisely choose Larger models: One key aspect to emphasise here is that "the larger models have higher costing associated with them." The score of a potential model in common benchmarks is important, but we should not forget that it’s the larger models that have a better score. Also, larger models require GPUs to run. 5. Experimenting: with these 3–4 embedding models and selecting the best one is another recommendation from us. When we use an embedding model for search, we run it twice:

When doing offline indexing of available data

When embedding a user query for a search request We need to understand two important consequences of this - The 1st consequence is that when we change or upgrade an embedding model we have to reindex all existing data. With newer and better models are released all the time, keep model upgradability in mind. Upgrading a model is the easiest way to improve overall system performance. The 2nd consequence of using an embedding model for user queries is that as the number of users goes up, the inference latency becomes very crucial to manage. Model inference takes more time for better-performing models. It commonly turns out that smaller, leaner models are still very important in a higher-load production scenario. Also, it is becoming really common to create your own embedding models which are customised for use cases.