Fine-tuning Passage Embeddings with GenQ

State-of-the-art semantic search applications like extractive question answering and Retrieval Augmented Generation (RAG) have gained significant attention recently, allowing users to query a large document corpus using natural language questions. One of the most important components in these systems is the embedding model, which maps input text into a vector, providing a numeric representation of the semantic meaning of the text. An effective model is tuned to place passages with similar meaning in similar locations in the embedding space. In a process known as retrieval, this space is efficiently interrogated to find the most relevant documents given an input query.

In practice, general-purpose language models do not perform well on this retrieval task, particularly when dealing with documents related to a specialized domain, like medicine or law. The typical solution is to "fine-tune" a generic model to the particular task and field of interest by feeding it thousands of labeled training examples. For a semantic search engine, the most valuable dataset would be human-generated (query, passage) pairs.

For example, this excerpt from the Sagemaker Developer Guide contains an appropriate answer for the following query:

Query: What instance type should I use for an NLP model?

Relevant Passage: Choose the right instance type and infrastructure management tools for your use case. You can start from a small instance and scale up depending on your workload. For training a model on a tabular dataset, start with the smallest CPU instance of the C4 or C5 instance families. For training a large model for computer vision or natural language processing, start with the smallest GPU instance of the P2, P3, G4dn or G5 instance families.

However, it would be extremely expensive and time consuming to have humans generate questions for thousands of documents. In this series of notebooks, we will demonstrate an approach called GenQ, in which we leverage a separate model to automatically generate so-called "synthetic" queries for our dataset. For more details about GenQ, see the NIPS publication.

Getting Started

This demonstration is broken into 3 notebooks which are designed to be completed in order.

  1. 01-Data-Preparation: initialization, dataset preparation, generating synthetic queries
  2. 02-Finetune-and-Deploy-Model: fine-tuning the embedding model and deploying it to a Sagemaker endpoint
  3. 03-Applications-and-Evaluation: comparing the fine-tuned model to a baseline and using it for RAG and extractive QA

Throughout the demo, we will leverage frameworks like LangChain and Hugging Face to simplify and operationalize the workflow. We will also use Amazon SageMaker for fully-managed and scalable inference (batch & real-time) and training.

These notebooks are designed to be run on a SageMaker Notebook using the conda_pytorch_p39 kernel, which will greatly simplify configuration and initialization.


