Skip to content

Commit

Permalink
fix: benchmark of embedding and reranking
Browse files Browse the repository at this point in the history
  • Loading branch information
LongxingTan committed Jun 29, 2024
1 parent 26ad6e9 commit df31d6e
Show file tree
Hide file tree
Showing 21 changed files with 356 additions and 223 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ jobs:
- name: Install dependencies
shell: bash
run: |
pip install -r requirements.txt
pip install --no-cache-dir -r requirements.txt
pip install --extra-index-url https://pypi.org/simple --no-cache-dir coverage pytest codecov-cli>=0.4.1
- name: Run unittest
Expand Down
31 changes: 13 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,18 +29,17 @@

![structure](./docs/source/_static/structure.png)

**Open-retrievals** simplifies text embeddings, retrievals, ranking, and RAG using PyTorch and Transformers. This user-friendly framework is designed for information retrieval and LLM generation.
- Embeddings, retrieval and rerank all-in-one: `AutoModelForEmbedding`
- Contrastive learning/LLM enhanced embeddings, with point-wise, pairwise and listwise fine-tuning
- Cross-encoder, ColBERT and LLM reranker
- Fast RAG easily integrated with Langchain and LlamaIndex

| Exp | Model | Original | Finetune | Demo |
|----------------------------|-------------------------|----------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| embed pairwise finetune | bge-base-zh-v1.5 | 0.657 | **0.701** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing) |
| embed llm finetune (LoRA) | Qwen2-1.5B-Instruct | 0.541 | **0.690** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing) |
| rerank cross encoder | bge-reranker-base | 0.666 | **0.691** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing) |
| rerank colbert | chinese-roberta-wwm-ext | 0.643 | **0.683** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing) |
**Open-retrievals** improve and unify text embedding, retrieval, reranking and RAG.
- Embeddings fine-tuned through point-wise, pairwise, listwise, contrastive learning, and LLM.
- Reranking fine-tuned with Cross Encoder, ColBERT, and LLM.
- Easily build enhanced RAG, integrated with Transformers, Langchain, and LlamaIndex.

| Exp | Model | Original | Finetune | Demo |
|---------------------------|-------------------------|----------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| embed pairwise finetune | bge-base-zh-v1.5 | 0.657 | **0.703** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing) |
| embed LLM finetune (LoRA) | Qwen2-1.5B-Instruct | 0.546 | **0.694** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing) |
| rerank cross encoder | bge-reranker-base | 0.666 | **0.706** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing) |
| rerank colbert | chinese-roberta-wwm-ext | 0.643 | **0.687** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing) |

* The metrics is MAP in [t2-reranking data](https://huggingface.co/datasets/C-MTEB/T2Reranking). Original score of LLM and colbert original is Zero-shot

Expand Down Expand Up @@ -186,8 +185,6 @@ print(response)

**Embedding fine-tuning**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing)

```python
import torch.nn as nn
from datasets import load_dataset
Expand Down Expand Up @@ -228,8 +225,6 @@ trainer.train()

**Rerank Fine-tuning**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing)

```python
from transformers import AutoTokenizer, TrainingArguments, get_cosine_schedule_with_warmup, AdamW
from retrievals import RerankCollator, AutoModelForRanking, RerankTrainer, RerankDataset
Expand All @@ -251,14 +246,14 @@ training_args = TrainingArguments(
learning_rate=learning_rate,
per_device_train_batch_size=batch_size,
num_train_epochs=epochs,
output_dir = './checkpoints',
output_dir='./checkpoints',
remove_unused_columns=False,
)
trainer = RerankTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
data_collator=RerankCollator(tokenizer, query_max_length=max_length, document_max_length=max_length),
data_collator=RerankCollator(tokenizer, max_length=max_length),
)
trainer.optimizer = optimizer
trainer.scheduler = scheduler
Expand Down
25 changes: 10 additions & 15 deletions README_ja-JP.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,13 @@
- 対照学習エンベッディング, LLM エンベッディング
- 高速 RAG デモ

| Exp | Model | Size | Original | Finetune | Demo |
|---------------------------|-------------------------|------|----------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| embed pairwise finetune | bge-base-zh-v1.5 | 0 | 0.657 | **0.703** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing) |
| embed LLM finetune (LoRA) | Qwen2-1.5B-Instruct | 0 | 0.546 | **0.694** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing) |
| rerank cross encoder | bge-reranker-base | 0 | 0.666 | **0.706** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing) |
| rerank colbert | chinese-roberta-wwm-ext | 0 | 0.643 | **0.687** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing) |


## インストール

Expand All @@ -49,14 +56,6 @@ pip install peft # 必要な場合
pip install open-retrievals
```

[//]: # (**With conda**)

[//]: # (```shell)

[//]: # (conda install open-retrievals -c conda-forge)

[//]: # (```)


## クイックスタート

Expand Down Expand Up @@ -180,8 +179,6 @@ print(response)

**コントラスト学習による transformers のウェイトのファインチューニング**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing)

```python
import torch.nn as nn
from datasets import load_dataset
Expand Down Expand Up @@ -226,15 +223,13 @@ from retrievals import AutoModelForEmbedding

model = AutoModelForEmbedding.from_pretrained(
"mistralai/Mistral-7B-v0.1",
pooling_method='cls',
pooling_method='last',
query_instruction=f'Instruct: Retrieve semantically similar text\nQuery: '
)
```

**リランク**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing)

```python
from transformers import AutoTokenizer, TrainingArguments, get_cosine_schedule_with_warmup, AdamW
from retrievals import RerankCollator, AutoModelForRanking, RerankTrainer, RerankDataset
Expand All @@ -256,14 +251,14 @@ training_args = TrainingArguments(
learning_rate=learning_rate,
per_device_train_batch_size=batch_size,
num_train_epochs=epochs,
output_dir = './checkpoints',
output_dir='./checkpoints',
remove_unused_columns=False,
)
trainer = RerankTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
data_collator=RerankCollator(tokenizer, query_max_length=max_length, document_max_length=128),
data_collator=RerankCollator(tokenizer, max_length=max_length),
)
trainer.optimizer = optimizer
trainer.scheduler = scheduler
Expand Down
40 changes: 13 additions & 27 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,19 +29,19 @@

![structure](./docs/source/_static/structure.png)

**Open-Retrievals** 帮助开发者在信息检索、大语言模型等领域便捷地应用文本向量,快速搭建检索、排序、RAG等应用。
- `AutoModelForEmbedding`一统向量、检索、重排
- 支持向量与重排模型多种微调方式,对比学习、大模型、point-wise、pairwise、listwise
- 定制化RAG框架,也支持在Langchain、LlamaIndex中便捷使用微调后的模型
**Open-Retrievals** 一统向量、检索、重排,帮助开发者在信息检索、大语言模型RAG等领域便捷优化
- 支持全套向量微调,对比学习、大模型、point-wise、pairwise、listwise
- 支持全套重排微调,cross encoder、ColBERT、LLM
- 支持定制化RAG框架,支持在Transformers、Langchain、LlamaIndex中便捷使用微调后的模型

| 实验 | 模型 | 原分数 | 微调分数 | Demo代码 |
|-----------------|---------------------------|--------|-------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 向量pairwise微调 | bge-base-zh-v1.5 | 0.657 | **0.701** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing)|
| 向量大模型LoRA微调 | Qwen2-1.5B-Instruct | 0.541 | **0.690** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing)|
| cross encoder重排 | bge-reranker-base | 0.666 | **0.691** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing)|
| colbert重排 | chinese-roberta-wwm-ext | 0.643 | **0.683** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing)|
| 实验 | 模型 | 尺寸 | 原分数 | 微调分数 | Demo代码 |
|------------------|-------------------------|-----|-------|-----------|-------------------------------------------------------------------------------------------------------------------------------------|
| 向量pairwise微调 | bge-base-zh-v1.5 | - | 0.657 | **0.703** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing) |
| 向量大模型LoRA微调 | Qwen2-1.5B-Instruct | - | 0.546 | **0.694** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing) |
| cross encoder重排 | bge-reranker-base | - | 0.666 | **0.706** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing) |
| colbert重排 | chinese-roberta-wwm-ext | - | 0.643 | **0.687** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing) |

* 指标为[t2-reranking数据](https://huggingface.co/datasets/C-MTEB/T2Reranking)的MAP. 其中大模型与LLM的原分数为Zero-shot
* 指标为[t2-reranking数据](https://huggingface.co/datasets/C-MTEB/T2Reranking)的MAP. 其中大模型与ColBERT原分数为Zero-shot


## 安装
Expand Down Expand Up @@ -189,18 +189,6 @@ print(response)

**向量模型微调**

[//]: # (- Model performance fine-tuned in [T2Ranking](https://huggingface.co/datasets/THUIR/T2Ranking))

[//]: # ()
[//]: # (| Model | Size | AP<sup>val</sup> | AP<sub>50</sub><sup>val</sup> | AP<sub>75</sub><sup>val</sup> |)

[//]: # (| :-- | :-: | :-: | :-: | :-: |)

[//]: # (| TripletLoss | 672 | 47.7% |52.6% | 61.4% |)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing)

```python
import torch.nn as nn
from datasets import load_dataset
Expand Down Expand Up @@ -268,8 +256,6 @@ torchrun --nproc_per_node 1 \

**重排模型微调**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing)

```python
from transformers import AutoTokenizer, TrainingArguments, get_cosine_schedule_with_warmup, AdamW
from retrievals import RerankCollator, AutoModelForRanking, RerankTrainer, RerankDataset
Expand All @@ -291,14 +277,14 @@ training_args = TrainingArguments(
learning_rate=learning_rate,
per_device_train_batch_size=batch_size,
num_train_epochs=epochs,
output_dir = './checkpoints',
output_dir='./checkpoints',
remove_unused_columns=False,
)
trainer = RerankTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
data_collator=RerankCollator(tokenizer, query_max_length=max_length, document_max_length=max_length),
data_collator=RerankCollator(tokenizer, max_length=max_length),
)
trainer.optimizer = optimizer
trainer.scheduler = scheduler
Expand Down
Binary file modified docs/source/_static/structure.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
50 changes: 46 additions & 4 deletions docs/source/embed.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,40 @@ Pretrained
we can use `AutoModelForEmbedding` to get the sentence embedding from pretrained transformer or large language model.


Fine-tune
------------------

point-wise

- `{(query, label), (document, label)}`


pairwise

- `{(query, positive, label), (query, negative, label)}`

- `{(query, positive, negative), {query, positive, negative}}`

- `{(query, positive, negative1, negative2, negative3...)}`

listwise

- `{(query+positive)}`


Loss function
~~~~~~~~~~~~~~~~~~~~~~

- binary classification:
- similarity(query, positive) > similarity(query, negative)
- hinge loss: max(0, similarity(query, positive) - similarity(query, negative) + margin)
- logistic loss: logistic(similarity(query, positive) - similarity(query, negative))
- multi-label classification:
- similarity(query, positive), similarity(query, negative1), similarity(query, negative2)


Pair wise
----------------------
~~~~~~~~~~~~~

.. code-block:: python
Expand Down Expand Up @@ -54,7 +85,7 @@ Pair wise
Point wise
-------------------
~~~~~~~~~~~~~

If the positive and negative examples have some noise in label, the directly point-wise cross-entropy maybe not the best. The pair wise just compare relatively, or the hinge loss with margin could be better.

Expand All @@ -67,9 +98,20 @@ arcface


List wise
-------------------
~~~~~~~~~~~~~~


Enhance the performance
--------------------------------------

* Pretrain
* In batch negative
* Hard negative, multiple rounds negative
* Cross batch negative
* knowledge distill from cross encoder
* maxsim (multi vector)
* Matryoshka


Hard negative
--------------------
~~~~~~~~~~~~~~~~
14 changes: 1 addition & 13 deletions docs/source/retrieval.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,10 @@ Retrieval
========================


Offline document encoding
Offline indexing
----------------------------



Query retrieval
----------------------------



To enhance the retrieval performance:

* Pretrain
* In batch negative
* Hard negative, multiple rounds negative
* Cross batch negative
* knowledge distill from cross encoder
* maxsim (multi vector)
* Matryoshka
Loading

0 comments on commit df31d6e

Please sign in to comment.