Skip to content

aws-samples/finetuning-passage-embeddings-with-genq

Fine-tuning Passage Embeddings with GenQ

State-of-the-art semantic search applications like extractive question answering and Retrieval Augmented Generation (RAG) have gained significant attention recently, allowing users to query a large document corpus using natural language questions. One of the most important components in these systems is the embedding model, which maps input text into a vector, providing a numeric representation of the semantic meaning of the text. An effective model is tuned to place passages with similar meaning in similar locations in the embedding space. In a process known as retrieval, this space is efficiently interrogated to find the most relevant documents given an input query.

In practice, general-purpose language models do not perform well on this retrieval task, particularly when dealing with documents related to a specialized domain, like medicine or law. The typical solution is to "fine-tune" a generic model to the particular task and field of interest by feeding it thousands of labeled training examples. For a semantic search engine, the most valuable dataset would be human-generated (query, passage) pairs.

For example, this excerpt from the Sagemaker Developer Guide contains an appropriate answer for the following query:

Query: What instance type should I use for an NLP model?

Relevant Passage: Choose the right instance type and infrastructure management tools for your use case. You can start from a small instance and scale up depending on your workload. For training a model on a tabular dataset, start with the smallest CPU instance of the C4 or C5 instance families. For training a large model for computer vision or natural language processing, start with the smallest GPU instance of the P2, P3, G4dn or G5 instance families.

However, it would be extremely expensive and time consuming to have humans generate questions for thousands of documents. In this series of notebooks, we will demonstrate an approach called GenQ, in which we leverage a separate model to automatically generate so-called "synthetic" queries for our dataset. For more details about GenQ, see the NIPS publication.

Getting Started

This demonstration is broken into 3 notebooks which are designed to be completed in order.

  1. 01-Data-Preparation: initialization, dataset preparation, generating synthetic queries
  2. 02-Finetune-and-Deploy-Model: fine-tuning the embedding model and deploying it to a Sagemaker endpoint
  3. 03-Applications-and-Evaluation: comparing the fine-tuned model to a baseline and using it for RAG and extractive QA

Throughout the demo, we will leverage frameworks like LangChain and Hugging Face to simplify and operationalize the workflow. We will also use Amazon SageMaker for fully-managed and scalable inference (batch & real-time) and training.

These notebooks are designed to be run on a SageMaker Notebook using the conda_pytorch_p39 kernel, which will greatly simplify configuration and initialization.

License

This repository uses the MIT-0 license.

Although this repository is released under the MIT-0 license, its example dataset uses the third party PubMed summarization repository. The PubMed summarization project’s licensing includes the Apache 2.0 license.

Developers

See our Code of Conduct and Contributing Guidelines for information on how to contribute to this repository.

References

Cohan, A., Dernoncourt, F., Kim, D. S., Bui, T., Kim, S., Chang, W., & Goharian, N. (2018, June). A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents. Proceedings of the 2018 Conference of the North AMerican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 615–621. doi:10.18653/v1/N18-2097

LangChain Documentation

SageMaker Python SDK

Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108.

Thakur, N., Reimers, N., Ruckl'e, A., Srivastava, A., & Gurevych, I. (2021). BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. ArXiv, abs/2104.08663.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages