How to Choose Right Embedding Models

Embeddings are critical for Retrieval-Augmented Generation (RAG) systems, they are often created using large language models (LLMs) that have been trained on extensive datasets. Selecting the right embedding model involves evaluating your use case, data requirements, and performance needs.

 

Below are few key considerations:

Static or Contextual Embeddings

Determine whether your use case requires static embeddings (which remain the same regardless of context) or contextual embeddings (which vary based on surrounding content and produces different vectors for same work based on its context within a sentence.

 

Static Embedding

Static embeddings assign a fixed vector to each word, independent of its context or sequence in which the word appears.

 

Embedding Models: Word2Vec, GloVe, Doc2Vec, TF-IDF, etc.

 

Example:

User Query Knowledge Base
“I need help with my account settings.” Knowledge Base 1: “Settings for your account are currently unavailable.”

 

In this example, the words "account" and "settings" would have same vector representations in both the user query and knowledge base1, regardless of its specific context.

Techniques like Word2Vec, GloVe, Doc2Vec (dense vector-based), and TF-IDF (keyword/sparse vector-based) enable the system to find relevant results based on the cosine similarity of these vectors.

 

Limitations:

  • Polysemy Issue: The word “account” may have multiple meanings (e.g., a social media account or financial loan account), leading to confusion since it shares the same vector.
  • Context Insensitivity: The static embeddings cannot differentiate between various issues related to account settings. For instance, they cannot specify whether the settings are unavailable due to technical problems, user permissions, or other reasons.

 

Contextual Embedding

Contextual embeddings generate vectors that vary based on the surrounding text, capturing bidirectional or focused context.

 

  • Bidirectional:This approach captures context from both the left and right sides of a word within a sentence, resulting in a comprehensive understanding of the entire sentence.
  • Focused Context: This method is specifically tailored for grasping context in shorter text segments, such as sentences or paragraphs.

 

Embedding Models: BERT, RoBERTa, all-MiniLM-L6-v2, SBERT, ColBERT, etc.

 

Example:

User Query Knowledge Base
“I need help with my account settings.” Knowledge Base 1: “Settings for your account are currently unavailable.”
Knowledge Base 2: “Unable to update settings due to insufficient permissions.”
Knowledge Base 3: “Settings change failed due to system error.”
Knowledge Base 4: “User cannot access account settings page.”

 

Models like BERT, RoBERTa, all-MiniLM-L6-v2, and SBERT (a masked language model), as well as Paraphrase-MPNet-Base-v2 (a permutated language model), effectively understand context, linking phrases like “help with my account settings” to related issues such as “unable to update settings” and “cannot access settings page.” This makes them excellent choices for the retrieval process.

 

ColBERT (Contextualized Late Interaction over BERT) utilizes BM25 for initial document retrieval and then employs BERT-based contextual embeddings for detailed re-ranking, optimizing both efficiency and contextual accuracy in information retrieval tasks.

  • Process Step 1: ColBERT first retrieves relevant document using BM25.
  • Process Step 2: It identifies key information like ‘Settings for your account are currently unavailable’, ‘Unable to update settings due to insufficient permissions.’ , ‘User cannot access account settings page’ and ‘Settings change failed due to system error.’
  • Process Step 3: Then, ColBERT applies BERT-based contextual embeddings to re-rank the results, ensuring the most relevant and accurate responses are prioritized.
  •  

    Limitations:

    • Context Limitation: The Masked and Permuted Language Model excels at understanding context within a specific text span, such as a sentence or paragraph. However, it cannot generate text or perform tasks beyond simply understanding and retrieving relevant documents.
    • GPT (Generative Pre-trained Transformer) Embedding: GPT embedding models utilize a method called "transformer embeddings," which captures not only the individual meanings of words but also the context in which they occur. The GPT architecture processes a sequence of words (or tokens) through several layers of transformer blocks. Each block transforms the input into a new sequence of vectors that reflect both the meanings of the words and their relationships with one another. This makes GPT embeddings much more dynamic and context-aware compared to traditional word embeddings.

     

    Unidirectional: GPT embeddings only consider the context from the left side, they build the understanding sequentially as they generate text.

     

    Broad Context: They can maintain coherence over longer text sequences, which makes them particularly effective for producing extended passages of text.

     

    Embedding Models: GTR-T5, text-embedding-3-large, google-gecko-text-embedding, amazon-titan, etc.

    Example:

    User Query Knowledge Base
    “I need help with my account settings.” Knowledge Base 1: “Settings for your account are currently unavailable.”
    Knowledge Base 2: “Unable to update settings due to insufficient permissions.”
    Knowledge Base 3: “Settings change failed due to system error.”
    Knowledge Base 4: “User cannot access account settings page.”
    Knowledge Base 5: “Please ensure your account is verified to access settings options.”

    Generative-based embeddings: Good for the generation step of RAG. They understand that phrases like “Settings for your account are currently unavailable.” and “User cannot access account settings page.” They facilitate generating responses that are contextually relevant and broader context for example “Please ensure your account is verified to access settings options.”

     

    Limitations:

    • GPT embeddings may struggle with generating accurate responses for highly specialized or niche topics due to limited exposure during training.
    • GPT tend to require more computational resources compared to purely contextual models like BERT.

    General vs. Domain-Specific Models

    For your use case choose if you need a general-purpose model that can handle a wide range of topics or a domain-specific model that is tailored for a particular field or industry.

    2.1 General vs. Domain-Specific Embedding Models

    When to Use Generic Embedding Models When to Use Domain-Specific Embedding Models
    • General Knowledge Tasks: Suitable for broad applications that don’t require deep domain-specific knowledge, like general question-answering or summarization.
    • Rapid Development: Ideal for quickly building RAG systems when specific domain data is unavailable or when testing concepts.
    • Diverse Query Handling: Effective for handling a wide range of user queries that span multiple topics or domains.
    • Scalability: Works well when large, diverse datasets are available, allowing for better coverage of common knowledge.
    • Specialized Knowledge Retrieval: Essential for retrieving and generating content in niche areas like healthcare, law, or scientific research, where accuracy is critical.
    • Contextual Understanding: Important when user queries require a nuanced understanding of specific terminology and concepts.
    • High Precision Requirements: Necessary when the quality and precision of responses are paramount, such as in medical diagnoses or legal interpretations.
    • Fine-Tuned Response Generation: Useful when responses need to be tailored to specific contexts or standards relevant to a particular domain.

    Exploring Generic and Domain-Specific Embedding Models

    Generic Embedding Models Domain-Specific Embedding Models
    • Word2Vec:A classic model for generating vector representations of words based on their contexts.
    • Global Vectors for Word Representation(GloVe): An unsupervised model that creates word embeddings from word co-occurrence statistics.
    • FastText: An extension of Word2Vec that accounts for sub-word information to improve understanding of rare words.
    • Bidirectional Encoder Representations from Transformers (BERT): A transformer-based model suitable for various NLP tasks, including those requiring contextual understanding.
    • BioBERT: Fine-tuned for biomedical text retrieval and generation.
    • ClinicalBERT: Trained on clinical narratives for healthcare tasks.
    • LegalBERT: Tailored for legal documents and queries.
    • FinancialBERT: Fine-tuned for financial texts like news and reports.

    Evaluating Accuracy of Domain-Specific Models

    Compare BERT and BioBERT by creating embeddings for two sentences and measuring cosine similarity.

    Example of Evaluating Accuracy of BERT and BioBERT embedding

    Example Result of Evaluating Accuracy of BERT and BioBERT

    BERT and BioBERT Evaluation Result
    Zoom In Zoom Out

    Choosing Between Open-Source and Closed-Source Embedding Models: A Practical Guide

    Select between open-source models (e.g., evaluated in the Massive Text Embedding Benchmark) or closed-source models with proprietary benefits.

     

    Open-Source Embedding Models

    • Local Accessibility: These models are easier to implement and run on your cloud or local storage.
    • Cost-Effective: Utilizing them locally is free, and they can be more affordable than commercial options for paid inference.
    • Data Privacy: Ideal for scenarios where you want to keep your data private and avoid sharing it with external APIs.
    • Control Over Processing: They provide greater flexibility and control over your search pipeline and data handling.
    • Local Data Utilization: Particularly beneficial when you have substantial local datasets that you wish to analyse.
    When to Use When Not to Use
    • If you prioritize data privacy and do not want to share sensitive information.
    • When you have large volumes of data that need processing closer to your storage account.
    • When you require customization and control over your embedding and search processes.
    • If you lack the technical expertise to implement and maintain the model.
    • When you need guaranteed support and reliability, which may not be available for open-source solutions.
    • When you need access to the latest features and optimizations that may be offered by commercial embedding services.

    Comparison of Open-Source Embedding Models Using Mean Reciprocal Rank (MRR)

    3.1.2 Analysis Result

    Open-Source MRR Analysis Result
    Zoom In Zoom Out

    Closed-Source Embedding Models

    • High Inference Speed : These models typically offer very fast inference times, although costs can accumulate with each token processed.
    • Optimized Performance : They often excel in specific applications, such as multilingual tasks or generating instruction-based embeddings.
    When to Use When Not to Use
    • If your project involves complex tasks like multilingual processing or specific embedding requirements that benefit from specialized optimization.
    • When you require rapid inference speeds for high-volume applications.
    • When you want to evaluate the model's performance using a free trial before making a financial commitment.
    • When you need full control over your model and data, as closed source options may limit customization.
    • If you have budget constraints, as costs can quickly add up with token-based pricing.
    • If you prefer an open-source solution that allows for greater transparency and community support.

    Additional Key Considerations

    The MTEB (Model and Text Embedding Benchmark) Leaderboard is an excellent resource for exploring the current landscape of both proprietary and open-source text embedding models, particularly for Retrieval-Augmented Generation (RAG) applications. It provides a comprehensive overview of each model, detailing important metrics such as model size, memory requirements, embedding dimensions, maximum token capacity, and performance scores across various tasks, including retrieval, summarization, clustering, reranking, and classification.

    • Max Tokens: Consider the maximum number of tokens the model can process, which can affect performance and relevance.

      Maximum token indicates the upper limit of tokens that can be processed into a single embedding. In the context of Retrieval-Augmented Generation (RAG), an ideal chunk size is usually around a single paragraph or less, typically consisting of about 100 tokens. For most applications, models with a maximum token capacity of 512 are more than sufficient. However, there are certain situations where embedding longer texts may be necessary, which would require models with a larger context window to effectively handle the additional tokens. This consideration is crucial for optimizing the performance of RAG systems.

    • Retrieval Average: Look at how effectively the model retrieves relevant information, as this impacts overall quality.

      Retrieval Average metric reflects the average Normalized Discounted Cumulative Gain (NDCG) at rank 10 across multiple datasets. NDCG is widely used to evaluate the effectiveness of retrieval systems. A higher NDCG score indicates that the embedding model excels at prioritizing relevant items at the top of the retrieved results. This is particularly important in Retrieval-Augmented Generation (RAG) applications, where the quality of the retrieved information significantly impacts the overall performance and relevance of the generated responses.

    • Embedding Dimensionality: Assess the dimensionality of the embeddings, as higher dimensions can capture more detail but may also increase complexity.

      Embedding Dimensions refers to the length of the embedding vector generated by the model. Smaller embedding dimensions can lead to faster inference and are more efficient in terms of storage, making them ideal for quick retrieval in RAG applications. However, they may sacrifice some accuracy in semantic representation. In contrast, larger embedding dimensions allow for greater expressiveness, enabling the model to better capture intricate relationships and patterns within the data. This, however, comes at the cost of slower search times and increased memory requirements. The goal is to find an optimal balance between capturing the complexity of the data and maintaining operational efficiency, especially when determining chunk sizes for effective embedding in RAG tasks.

    • Model Size: Take into account the size of the model, which influences both computational resources and deployment capabilities.

      This refers to the size of the embedding model measured in gigabytes (GB), which provides insight into the computational resources needed to operate the model. While larger models typically offer improved retrieval performance, it's crucial to recognize that increased model size can also lead to higher latency. This latency-performance trade-off is particularly significant in production environments, where response time is critical for effective Retrieval-Augmented Generation (RAG) applications. Balancing model size and latency is essential for optimizing both performance and user experience.