fix: benchmark of embedding and reranking

LongxingTan · Jun 29, 2024 · df31d6e · df31d6e
1 parent 26ad6e9
commit df31d6e
Show file tree

Hide file tree

Showing 21 changed files with 356 additions and 223 deletions.
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -36,7 +36,7 @@ jobs:
       - name: Install dependencies
         shell: bash
         run: |
-          pip install -r requirements.txt
+          pip install --no-cache-dir -r requirements.txt
           pip install --extra-index-url https://pypi.org/simple --no-cache-dir coverage pytest codecov-cli>=0.4.1
 
       - name: Run unittest

diff --git a/README.md b/README.md
@@ -29,18 +29,17 @@
 
 ![structure](./docs/source/_static/structure.png)
 
-**Open-retrievals** simplifies text embeddings, retrievals, ranking, and RAG using PyTorch and Transformers. This user-friendly framework is designed for information retrieval and LLM generation.
-- Embeddings, retrieval and rerank all-in-one: `AutoModelForEmbedding`
-- Contrastive learning/LLM enhanced embeddings, with point-wise, pairwise and listwise fine-tuning
-- Cross-encoder, ColBERT and LLM reranker
-- Fast RAG easily integrated with Langchain and LlamaIndex
-
-| Exp                        | Model                   | Original | Finetune  | Demo                                                                                                                                                                |
-|----------------------------|-------------------------|----------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| embed pairwise finetune    | bge-base-zh-v1.5        | 0.657    | **0.701** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing) |
-| embed llm finetune (LoRA)  | Qwen2-1.5B-Instruct     | 0.541    | **0.690** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing) |
-| rerank cross encoder       | bge-reranker-base       | 0.666    | **0.691** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing) |
-| rerank colbert             | chinese-roberta-wwm-ext | 0.643    | **0.683** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing) |
+**Open-retrievals** improve and unify text embedding, retrieval, reranking and RAG.
+- Embeddings fine-tuned through point-wise, pairwise, listwise, contrastive learning, and LLM.
+- Reranking fine-tuned with Cross Encoder, ColBERT, and LLM.
+- Easily build enhanced RAG, integrated with Transformers, Langchain, and LlamaIndex.
+
+| Exp                       | Model                   | Original | Finetune  | Demo                                                                                                                                                                |
+|---------------------------|-------------------------|----------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| embed pairwise finetune   | bge-base-zh-v1.5        | 0.657    | **0.703** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing) |
+| embed LLM finetune (LoRA) | Qwen2-1.5B-Instruct     | 0.546    | **0.694** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing) |
+| rerank cross encoder      | bge-reranker-base       | 0.666    | **0.706** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing) |
+| rerank colbert            | chinese-roberta-wwm-ext | 0.643    | **0.687** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing) |
 
 * The metrics is MAP in [t2-reranking data](https://huggingface.co/datasets/C-MTEB/T2Reranking). Original score of LLM and colbert original is Zero-shot
 
@@ -186,8 +185,6 @@ print(response)
 
 **Embedding fine-tuning**
 
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing)
-
 ```python
 import torch.nn as nn
 from datasets import load_dataset
@@ -228,8 +225,6 @@ trainer.train()
 
 **Rerank Fine-tuning**
 
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing)
-
 ```python
 from transformers import AutoTokenizer, TrainingArguments, get_cosine_schedule_with_warmup, AdamW
 from retrievals import RerankCollator, AutoModelForRanking, RerankTrainer, RerankDataset
@@ -251,14 +246,14 @@ training_args = TrainingArguments(
     learning_rate=learning_rate,
     per_device_train_batch_size=batch_size,
     num_train_epochs=epochs,
-    output_dir = './checkpoints',
+    output_dir='./checkpoints',
     remove_unused_columns=False,
 )
 trainer = RerankTrainer(
     model=model,
     args=training_args,
     train_dataset=train_dataset,
-    data_collator=RerankCollator(tokenizer, query_max_length=max_length, document_max_length=max_length),
+    data_collator=RerankCollator(tokenizer, max_length=max_length),
 )
 trainer.optimizer = optimizer
 trainer.scheduler = scheduler

diff --git a/README_ja-JP.md b/README_ja-JP.md
@@ -34,6 +34,13 @@
 - 対照学習エンベッディング, LLM エンベッディング
 - 高速 RAG デモ
 
+| Exp                       | Model                   | Size | Original | Finetune  | Demo                                                                                                                                                                |
+|---------------------------|-------------------------|------|----------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| embed pairwise finetune   | bge-base-zh-v1.5        | 0    | 0.657    | **0.703** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing) |
+| embed LLM finetune (LoRA) | Qwen2-1.5B-Instruct     | 0    | 0.546    | **0.694** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing) |
+| rerank cross encoder      | bge-reranker-base       | 0    | 0.666    | **0.706** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing) |
+| rerank colbert            | chinese-roberta-wwm-ext | 0    | 0.643    | **0.687** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing) |
+
 
 ## インストール
 
@@ -49,14 +56,6 @@ pip install peft  # 必要な場合
 pip install open-retrievals
 ```
 
-[//]: # (**With conda**)
-
-[//]: # (```shell)
-
-[//]: # (conda install open-retrievals -c conda-forge)
-
-[//]: # (```)
-
 
 ## クイックスタート
 
@@ -180,8 +179,6 @@ print(response)
 
 **コントラスト学習による transformers のウェイトのファインチューニング**
 
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing)
-
 ```python
 import torch.nn as nn
 from datasets import load_dataset
@@ -226,15 +223,13 @@ from retrievals import AutoModelForEmbedding
 
 model = AutoModelForEmbedding.from_pretrained(
     "mistralai/Mistral-7B-v0.1",
-    pooling_method='cls',
+    pooling_method='last',
     query_instruction=f'Instruct: Retrieve semantically similar text\nQuery: '
 )
 ```
 
 **リランク**
 
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing)
-
 ```python
 from transformers import AutoTokenizer, TrainingArguments, get_cosine_schedule_with_warmup, AdamW
 from retrievals import RerankCollator, AutoModelForRanking, RerankTrainer, RerankDataset
@@ -256,14 +251,14 @@ training_args = TrainingArguments(
     learning_rate=learning_rate,
     per_device_train_batch_size=batch_size,
     num_train_epochs=epochs,
-    output_dir = './checkpoints',
+    output_dir='./checkpoints',
     remove_unused_columns=False,
 )
 trainer = RerankTrainer(
     model=model,
     args=training_args,
     train_dataset=train_dataset,
-    data_collator=RerankCollator(tokenizer, query_max_length=max_length, document_max_length=128),
+    data_collator=RerankCollator(tokenizer, max_length=max_length),
 )
 trainer.optimizer = optimizer
 trainer.scheduler = scheduler

diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -29,19 +29,19 @@
 
 ![structure](./docs/source/_static/structure.png)
 
-**Open-Retrievals** 帮助开发者在信息检索、大语言模型等领域便捷地应用文本向量，快速搭建检索、排序、RAG等应用。
-- `AutoModelForEmbedding`一统向量、检索、重排
-- 支持向量与重排模型多种微调方式，对比学习、大模型、point-wise、pairwise、listwise
-- 定制化RAG框架，也支持在Langchain、LlamaIndex中便捷使用微调后的模型
+**Open-Retrievals** 一统向量、检索、重排，帮助开发者在信息检索、大语言模型RAG等领域便捷优化
+- 支持全套向量微调，对比学习、大模型、point-wise、pairwise、listwise
+- 支持全套重排微调，cross encoder、ColBERT、LLM
+- 支持定制化RAG框架，支持在Transformers、Langchain、LlamaIndex中便捷使用微调后的模型
 
-| 实验              | 模型                      | 原分数 | 微调分数    | Demo代码                                                                                                                                                             |
-|-----------------|---------------------------|--------|-------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| 向量pairwise微调   | bge-base-zh-v1.5          | 0.657  | **0.701**   | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing)|
-| 向量大模型LoRA微调  | Qwen2-1.5B-Instruct       | 0.541  | **0.690**   | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing)|
-| cross encoder重排 | bge-reranker-base         | 0.666  | **0.691**   | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing)|
-| colbert重排       | chinese-roberta-wwm-ext   | 0.643  | **0.683**   | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing)|
+| 实验              | 模型                     | 尺寸  | 原分数   | 微调分数      | Demo代码                                                                                                                           |
+|------------------|-------------------------|-----|-------|-----------|-------------------------------------------------------------------------------------------------------------------------------------|
+| 向量pairwise微调   | bge-base-zh-v1.5        | -   | 0.657 | **0.703** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing) |
+| 向量大模型LoRA微调  | Qwen2-1.5B-Instruct     | -   | 0.546 | **0.694** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing) |
+| cross encoder重排 | bge-reranker-base       | -   | 0.666 | **0.706** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing) |
+| colbert重排       | chinese-roberta-wwm-ext | -   | 0.643 | **0.687** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing) |
 
-* 指标为[t2-reranking数据](https://huggingface.co/datasets/C-MTEB/T2Reranking)的MAP. 其中大模型与LLM的原分数为Zero-shot
+* 指标为[t2-reranking数据](https://huggingface.co/datasets/C-MTEB/T2Reranking)的MAP. 其中大模型与ColBERT原分数为Zero-shot
 
 
 ## 安装
@@ -189,18 +189,6 @@ print(response)
 
 **向量模型微调**
 
-[//]: # (- Model performance fine-tuned in [T2Ranking]&#40;https://huggingface.co/datasets/THUIR/T2Ranking&#41;)
-
-[//]: # ()
-[//]: # (| Model | Size | AP<sup>val</sup> | AP<sub>50</sub><sup>val</sup> | AP<sub>75</sub><sup>val</sup> |)
-
-[//]: # (| :-- | :-: | :-: | :-: | :-: |)
-
-[//]: # (| TripletLoss | 672 | 47.7% |52.6% | 61.4% |)
-
-
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing)
-
 ```python
 import torch.nn as nn
 from datasets import load_dataset
@@ -268,8 +256,6 @@ torchrun --nproc_per_node 1 \
 
 **重排模型微调**
 
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing)
-
 ```python
 from transformers import AutoTokenizer, TrainingArguments, get_cosine_schedule_with_warmup, AdamW
 from retrievals import RerankCollator, AutoModelForRanking, RerankTrainer, RerankDataset
@@ -291,14 +277,14 @@ training_args = TrainingArguments(
     learning_rate=learning_rate,
     per_device_train_batch_size=batch_size,
     num_train_epochs=epochs,
-    output_dir = './checkpoints',
+    output_dir='./checkpoints',
     remove_unused_columns=False,
 )
 trainer = RerankTrainer(
     model=model,
     args=training_args,
     train_dataset=train_dataset,
-    data_collator=RerankCollator(tokenizer, query_max_length=max_length, document_max_length=max_length),
+    data_collator=RerankCollator(tokenizer, max_length=max_length),
 )
 trainer.optimizer = optimizer
 trainer.scheduler = scheduler

diff --git a/docs/source/_static/structure.png b/docs/source/_static/structure.png
diff --git a/docs/source/embed.rst b/docs/source/embed.rst
@@ -10,9 +10,40 @@ Pretrained
 we can use `AutoModelForEmbedding` to get the sentence embedding from pretrained transformer or large language model.
 
 
+Fine-tune
+------------------
+
+point-wise
+
+- `{(query, label), (document, label)}`
+
+
+pairwise
+
+- `{(query, positive, label), (query, negative, label)}`
+
+- `{(query, positive, negative), {query, positive, negative}}`
+
+- `{(query, positive, negative1, negative2, negative3...)}`
+
+listwise
+
+- `{(query+positive)}`
+
+
+Loss function
+~~~~~~~~~~~~~~~~~~~~~~
+
+- binary classification:
+    - similarity(query, positive) > similarity(query, negative)
+    - hinge loss: max(0, similarity(query, positive) - similarity(query, negative) + margin)
+    - logistic loss: logistic(similarity(query, positive) - similarity(query, negative))
+- multi-label classification:
+    - similarity(query, positive), similarity(query, negative1), similarity(query, negative2)
+
 
 Pair wise
-----------------------
+~~~~~~~~~~~~~
 
 .. code-block:: python
 
@@ -54,7 +85,7 @@ Pair wise
 
 
 Point wise
--------------------
+~~~~~~~~~~~~~
 
 If the positive and negative examples have some noise in label, the directly point-wise cross-entropy maybe not the best. The pair wise just compare relatively, or the hinge loss with margin could be better.
 
@@ -67,9 +98,20 @@ arcface
 
 
 List wise
--------------------
+~~~~~~~~~~~~~~
+
+
+Enhance the performance
+--------------------------------------
 
+* Pretrain
+* In batch negative
+* Hard negative, multiple rounds negative
+* Cross batch negative
+* knowledge distill from cross encoder
+* maxsim (multi vector)
+* Matryoshka
 
 
 Hard negative
---------------------
+~~~~~~~~~~~~~~~~
diff --git a/docs/source/retrieval.rst b/docs/source/retrieval.rst
@@ -2,22 +2,10 @@ Retrieval
 ========================
 
 
-Offline document encoding
+Offline indexing
 ----------------------------
 
 
 
 Query retrieval
 ----------------------------
-
-
-
-To enhance the retrieval performance:
-
-* Pretrain
-* In batch negative
-* Hard negative, multiple rounds negative
-* Cross batch negative
-* knowledge distill from cross encoder
-* maxsim (multi vector)
-* Matryoshka