Skip to content

Commit

Permalink
feat: add basic file ocr pipe and contributing
Browse files Browse the repository at this point in the history
  • Loading branch information
LongxingTan committed Jul 3, 2024
1 parent df31d6e commit 3bc79e7
Show file tree
Hide file tree
Showing 33 changed files with 433 additions and 108 deletions.
4 changes: 2 additions & 2 deletions .readthedocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ version: 2
build:
os: ubuntu-22.04
tools:
python: "3.8"
python: "3.10"

# Build documentation in the docs/ directory with Sphinx
# reference: https://docs.readthedocs.io/en/stable/config-file/v2.html#sphinx
Expand All @@ -17,7 +17,7 @@ sphinx:
fail_on_warning: false

# Build documentation with MkDocs
#mkdocs:
# mkdocs:
# configuration: mkdocs.yml

# Optionally build your docs in additional formats such as PDF and ePub
Expand Down
21 changes: 21 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Contributing to open-retrievals

If you are interested in contributing to open-retrievals,

- Feel free to send a Pull Request
- If you want to implement a new feature and unsure about it, you can post an issue first

Once you finish implementing a feature a bug-fix, please send a Pull Request to https://github.com/LongxingTan/open-retrievals


## Developing TFTS

To develop tfts on your machine, here are some tips:

1. Uninstall existing open-retrievals installations:
2. Clone a copy of open-retrievals from source
3. Create a new branch and edit the code
4. Install pre-commit hooks
5. Ensure your code is formatted correctly by testing against the styleguide of flake8
6. Ensure the entire test suite passed and the code coverage roughly stays the same
7. Update and test the documentation
35 changes: 21 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,20 +12,26 @@
[docs-url]: https://open-retrievals.readthedocs.io/en/latest/?version=latest
[coverage-image]: https://codecov.io/gh/longxingtan/open-retrievals/branch/master/graph/badge.svg
[coverage-url]: https://codecov.io/github/longxingtan/open-retrievals?branch=master
[contributing-image]: https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat
[contributing-url]: https://github.com/longxingtan/open-retrievals/blob/master/CONTRIBUTING.md

<h1 align="center">
<img src="./docs/source/_static/logo.svg" width="420" align=center/>
</h1>

[![LICENSE][license-image]][license-url]
[![PyPI Version][pypi-image]][pypi-url]
[![Build Status][build-image]][build-url]
[![Lint Status][lint-image]][lint-url]
[![Docs Status][docs-image]][docs-url]
[![Code Coverage][coverage-image]][coverage-url]
<div align="center">

[![LICENSE][license-image]][license-url]
[![PyPI Version][pypi-image]][pypi-url]
[![Build Status][build-image]][build-url]
[![Lint Status][lint-image]][lint-url]
[![Docs Status][docs-image]][docs-url]
[![Code Coverage][coverage-image]][coverage-url]
[![Contributing][contributing-image]][contributing-url]

**[Documentation](https://open-retrievals.readthedocs.io)** | **[中文](https://github.com/LongxingTan/open-retrievals/blob/master/README_zh-CN.md)** | **[日本語](https://github.com/LongxingTan/open-retrievals/blob/master/README_ja-JP.md)**
**[Documentation](https://open-retrievals.readthedocs.io)** | **[中文](https://github.com/LongxingTan/open-retrievals/blob/master/README_zh-CN.md)** | **[日本語](https://github.com/LongxingTan/open-retrievals/blob/master/README_ja-JP.md)**

</div>

![structure](./docs/source/_static/structure.png)

Expand All @@ -34,14 +40,15 @@
- Reranking fine-tuned with Cross Encoder, ColBERT, and LLM.
- Easily build enhanced RAG, integrated with Transformers, Langchain, and LlamaIndex.

| Exp | Model | Original | Finetune | Demo |
|---------------------------|-------------------------|----------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| embed pairwise finetune | bge-base-zh-v1.5 | 0.657 | **0.703** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing) |
| embed LLM finetune (LoRA) | Qwen2-1.5B-Instruct | 0.546 | **0.694** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing) |
| rerank cross encoder | bge-reranker-base | 0.666 | **0.706** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing) |
| rerank colbert | chinese-roberta-wwm-ext | 0.643 | **0.687** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing) |
| Exp | Model | Original | Finetuned | Demo |
|-------------------------------|-------------------------|----------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **embed** pairwise finetune | bge-base-zh-v1.5 | 0.657 | **0.703** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing) |
| **embed** LLM finetune (LoRA) | Qwen2-1.5B-Instruct | 0.546 | **0.694** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing) |
| **rerank** cross encoder | bge-reranker-base | 0.666 | **0.706** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing) |
| **rerank** colbert | chinese-roberta-wwm-ext | 0.643 | **0.687** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing) |
| **rerank** LLM (LoRA) | Qwen2-1.5B-Instruct | 0.531 | **0.699** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1fzq1iV7-f8hNKFnjMmpVhVxadqPb9IXk?usp=sharing) |

* The metrics is MAP in [t2-reranking data](https://huggingface.co/datasets/C-MTEB/T2Reranking). Original score of LLM and colbert original is Zero-shot
* The metrics is MAP in 10% [t2-reranking data](https://huggingface.co/datasets/C-MTEB/T2Reranking). Original score of LLM and colbert original is Zero-shot


## Installation
Expand Down
20 changes: 13 additions & 7 deletions README_ja-JP.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,20 +12,25 @@
[docs-url]: https://open-retrievals.readthedocs.io/en/latest/?version=latest
[coverage-image]: https://codecov.io/gh/longxingtan/open-retrievals/branch/master/graph/badge.svg
[coverage-url]: https://codecov.io/github/longxingtan/open-retrievals?branch=master
[contributing-image]: https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat
[contributing-url]: https://github.com/longxingtan/open-retrievals/blob/master/CONTRIBUTING.md

<h1 align="center">
<img src="./docs/source/_static/logo.svg" width="420" align=center/>
</h1>

<div align="center">

[![LICENSE][license-image]][license-url]
[![PyPI Version][pypi-image]][pypi-url]
[![Build Status][build-image]][build-url]
[![Lint Status][lint-image]][lint-url]
[![Docs Status][docs-image]][docs-url]
[![Code Coverage][coverage-image]][coverage-url]

[![Contributing][contributing-image]][contributing-url]

**[ドキュメント](https://open-retrievals.readthedocs.io)** | **[英語](https://github.com/LongxingTan/open-retrievals/blob/master/README.md)** | **[中文](https://github.com/LongxingTan/open-retrievals/blob/master/README_zh-CN.md)**
</div>

![structure](./docs/source/_static/structure.png)

Expand All @@ -34,12 +39,13 @@
- 対照学習エンベッディング, LLM エンベッディング
- 高速 RAG デモ

| Exp | Model | Size | Original | Finetune | Demo |
|---------------------------|-------------------------|------|----------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| embed pairwise finetune | bge-base-zh-v1.5 | 0 | 0.657 | **0.703** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing) |
| embed LLM finetune (LoRA) | Qwen2-1.5B-Instruct | 0 | 0.546 | **0.694** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing) |
| rerank cross encoder | bge-reranker-base | 0 | 0.666 | **0.706** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing) |
| rerank colbert | chinese-roberta-wwm-ext | 0 | 0.643 | **0.687** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing) |
| Exp | Model | Size | Original | Finetuned | Demo |
|-------------------------------|-------------------------|------|----------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **embed** pairwise finetune | bge-base-zh-v1.5 | - | 0.657 | **0.703** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing) |
| **embed** LLM finetune (LoRA) | Qwen2-1.5B-Instruct | - | 0.546 | **0.694** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing) |
| **rerank** cross encoder | bge-reranker-base | - | 0.666 | **0.706** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing) |
| **rerank** colbert | chinese-roberta-wwm-ext | - | 0.643 | **0.687** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing) |
| **rerank** LLM (LoRA) | Qwen2-1.5B-Instruct | - | 0.531 | **0.699** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1fzq1iV7-f8hNKFnjMmpVhVxadqPb9IXk?usp=sharing) |


## インストール
Expand Down
24 changes: 15 additions & 9 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,20 +12,25 @@
[docs-url]: https://open-retrievals.readthedocs.io/en/latest/?version=latest
[coverage-image]: https://codecov.io/gh/longxingtan/open-retrievals/branch/master/graph/badge.svg
[coverage-url]: https://codecov.io/github/longxingtan/open-retrievals?branch=master
[contributing-image]: https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat
[contributing-url]: https://github.com/longxingtan/open-retrievals/blob/master/CONTRIBUTING.md

<h1 align="center">
<img src="./docs/source/_static/logo.svg" width="420" align=center/>
</h1>

<div align="center">

[![LICENSE][license-image]][license-url]
[![PyPI Version][pypi-image]][pypi-url]
[![Build Status][build-image]][build-url]
[![Lint Status][lint-image]][lint-url]
[![Docs Status][docs-image]][docs-url]
[![Code Coverage][coverage-image]][coverage-url]
[![Contributing][contributing-image]][contributing-url]


**[中文wiki](https://github.com/LongxingTan/open-retrievals/wiki)** | **[英文文档](https://open-retrievals.readthedocs.io)** | **[Release Notes](https://open-retrievals.readthedocs.io/en/latest/CHANGELOG.html)**
**[中文wiki](https://github.com/LongxingTan/open-retrievals/wiki)** | **[英文文档](https://open-retrievals.readthedocs.io)**
</div>

![structure](./docs/source/_static/structure.png)

Expand All @@ -34,14 +39,15 @@
- 支持全套重排微调,cross encoder、ColBERT、LLM
- 支持定制化RAG框架,支持在Transformers、Langchain、LlamaIndex中便捷使用微调后的模型

| 实验 | 模型 | 尺寸 | 原分数 | 微调分数 | Demo代码 |
|------------------|-------------------------|-----|-------|-----------|-------------------------------------------------------------------------------------------------------------------------------------|
| 向量pairwise微调 | bge-base-zh-v1.5 | - | 0.657 | **0.703** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing) |
| 向量大模型LoRA微调 | Qwen2-1.5B-Instruct | - | 0.546 | **0.694** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing) |
| cross encoder重排 | bge-reranker-base | - | 0.666 | **0.706** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing) |
| colbert重排 | chinese-roberta-wwm-ext | - | 0.643 | **0.687** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing) |
| 实验 | 模型 | 尺寸| 原分数 | 微调分数 | Demo代码 |
|---------------------|-------------------------|----|-------|-----------|-------------------------------------------------------------------------------------------------------------------------------------|
| **向量**pairwise微调 | bge-base-zh-v1.5 | - | 0.657 | **0.703** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing) |
| **向量**大模型LoRA微调 | Qwen2-1.5B-Instruct | - | 0.546 | **0.694** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing) |
| cross encoder**重排** | bge-reranker-base | - | 0.666 | **0.706** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing) |
| colbert**重排** | chinese-roberta-wwm-ext | - | 0.643 | **0.687** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing) |
| LLM**重排** | Qwen2-1.5B-Instruct | - | 0.531 | **0.699** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1fzq1iV7-f8hNKFnjMmpVhVxadqPb9IXk?usp=sharing) |

* 指标为[t2-reranking数据](https://huggingface.co/datasets/C-MTEB/T2Reranking)的MAP. 其中大模型与ColBERT原分数为Zero-shot
* 指标为10% [t2-reranking数据](https://huggingface.co/datasets/C-MTEB/T2Reranking)的MAP. 其中大模型与ColBERT原分数为Zero-shot


## 安装
Expand Down
11 changes: 9 additions & 2 deletions docs/source/embed.rst
Original file line number Diff line number Diff line change
Expand Up @@ -113,5 +113,12 @@ Enhance the performance
* Matryoshka


Hard negative
~~~~~~~~~~~~~~~~
Hard mining
~~~~~~~~~~~~~~~~~~~~~~
offline hard mining

online hard mining


Ensemble embedding
~~~~~~~~~~~~~~~~~~~~~~
4 changes: 4 additions & 0 deletions docs/source/rag.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ The basic RAG process is document indexing, query embedding, retrieval, optional
Integrated with Langchain
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. image:: https://colab.research.google.com/assets/colab-badge.svg
:target: https://colab.research.google.com/drive/1fJC-8er-a4NRkdJkwWr4On7lGt9rAO4P?usp=sharing
:alt: Open In Colab


.. code-block:: python
Expand Down
45 changes: 35 additions & 10 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@
- [embedding-llm pairwise finetune](./embedding_llm_finetune.py)
- [rerank-cross encoder](./rerank_cross_encoder.py)
- [rerank-colbert](./rerank_colbert.py)
- [rerank-llm finetune](../reference/rerank_llm_finetune.py)
- [rerank-llm finetune](rerank_llm_finetune.py)
- [RAG with Langchain](./rag_langchain_demo.py)


## Retrieval
## Embedding

**Data Format**
```
Expand Down Expand Up @@ -88,6 +88,26 @@ torchrun --nproc_per_node 1 \
```


## Retrieval

```shell
QUERY_ENCODE_DIR=nq-queries
OUT_DIR=temp
MODEL_DIR="BAAI/bge-base-zh-v1.5"
QUERY=nq-test-queries.json
mkdir $QUERY_ENCODE_DIR

python -m retrievals.pipelines.embed \
--model_name_or_path $MODEL_DIR \
--output_dir $OUT_DIR \
--do_encode \
--fp16 \
--per_device_eval_batch_size 256 \
--train_data $QUERY \
--is_query true
```


## Rerank

**Cross encoder reranking**
Expand Down Expand Up @@ -172,19 +192,24 @@ torchrun --nproc_per_node 1 \
--positive_key positive \
--negative_key negative \
--learning_rate 2e-4 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 16 \
--dataloader_drop_last True \
--max_len 256 \
--train_group_size 4 \
--logging_steps 1 \
--save_steps 2000 \
--save_total_limit 2 \
--logging_steps 10 \
--save_steps 20000 \
--save_total_limit 1 \
--bf16
```


## Common questions
- If grad_norm during training is always zero, consider to change fp16 or bf16
- If the fine-tuned embedding performance during inference is worse, check whether the pooling_method is correct, and the prompt is the same as training
## FAQ

The grad_norm during training is always zero?
- consider to change fp16 or bf16

The fine-tuned embedding performance during inference is worse than original?
- check whether the pooling_method is correct
- check whether the prompt is the same as training for LLM model
9 changes: 8 additions & 1 deletion examples/eval/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,11 @@
# 评测
# Evaluation

**Prerequisites**
```shell
pip install datasets mteb[beir]
pip install C_MTEB
pip install open-retrievals
```


```python
Expand Down
Empty file added examples/rerank_llm_finetune.py
Empty file.
27 changes: 18 additions & 9 deletions examples/t2_ranking/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# T2_ranking

## Performance
An end-to-end example with [t2-reranking data](https://huggingface.co/datasets/C-MTEB/T2Reranking)

## Experiment

bge-base-zh-v1.5
- "map": 0.6569549236524207, "mrr": 0.7683207806932297
Expand All @@ -11,29 +13,36 @@ bge-reranker-base
- rerank/cross-encoder: "map": 0.6906494118852755, "mrr": 0.8064902548320916


## Prepare dataset
## 1. Prepare dataset
```shell
python prepare_t2ranking_data.py
```

## Train
```shell
## 2. Finetune embedding

```shell
sh pairwise_embed_train.sh
```

## Encode corpus
## Indexing
Encode corpus
```shell

sh encode_corpus.sh
```

## Encode Query
Encode Query
```shell

sh encode_query.sh
```

## Search
## Retrieve
```shell
sh retrieve.sh
```

## Rerank
```shell
sh rerank.sh
```

## Evaluate
Expand Down
Empty file.
Empty file.
Loading

0 comments on commit 3bc79e7

Please sign in to comment.