feat: add basic file ocr pipe and contributing

LongxingTan · Jul 3, 2024 · 3bc79e7 · 3bc79e7
1 parent df31d6e
commit 3bc79e7
Show file tree

Hide file tree

Showing 33 changed files with 433 additions and 108 deletions.
diff --git a/.readthedocs.yml b/.readthedocs.yml
@@ -8,7 +8,7 @@ version: 2
 build:
   os: ubuntu-22.04
   tools:
-    python: "3.8"
+    python: "3.10"
 
 # Build documentation in the docs/ directory with Sphinx
 # reference: https://docs.readthedocs.io/en/stable/config-file/v2.html#sphinx
@@ -17,7 +17,7 @@ sphinx:
   fail_on_warning: false
 
 # Build documentation with MkDocs
-#mkdocs:
+# mkdocs:
 #  configuration: mkdocs.yml
 
 # Optionally build your docs in additional formats such as PDF and ePub

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,21 @@
+# Contributing to open-retrievals
+
+If you are interested in contributing to open-retrievals,
+
+   - Feel free to send a Pull Request
+   - If you want to implement a new feature and unsure about it, you can post an issue first
+
+Once you finish implementing a feature a bug-fix, please send a Pull Request to https://github.com/LongxingTan/open-retrievals
+
+
+## Developing TFTS
+
+To develop tfts on your machine, here are some tips:
+
+1. Uninstall existing open-retrievals installations:
+2. Clone a copy of open-retrievals from source
+3. Create a new branch and edit the code
+4. Install pre-commit hooks
+5. Ensure your code is formatted correctly by testing against the styleguide of flake8
+6. Ensure the entire test suite passed and the code coverage roughly stays the same
+7. Update and test the documentation
diff --git a/README.md b/README.md
@@ -12,20 +12,26 @@
 [docs-url]: https://open-retrievals.readthedocs.io/en/latest/?version=latest
 [coverage-image]: https://codecov.io/gh/longxingtan/open-retrievals/branch/master/graph/badge.svg
 [coverage-url]: https://codecov.io/github/longxingtan/open-retrievals?branch=master
+[contributing-image]: https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat
+[contributing-url]: https://github.com/longxingtan/open-retrievals/blob/master/CONTRIBUTING.md
 
 <h1 align="center">
 <img src="./docs/source/_static/logo.svg" width="420" align=center/>
 </h1>
 
-[![LICENSE][license-image]][license-url]
-[![PyPI Version][pypi-image]][pypi-url]
-[![Build Status][build-image]][build-url]
-[![Lint Status][lint-image]][lint-url]
-[![Docs Status][docs-image]][docs-url]
-[![Code Coverage][coverage-image]][coverage-url]
+<div align="center">
 
+  [![LICENSE][license-image]][license-url]
+  [![PyPI Version][pypi-image]][pypi-url]
+  [![Build Status][build-image]][build-url]
+  [![Lint Status][lint-image]][lint-url]
+  [![Docs Status][docs-image]][docs-url]
+  [![Code Coverage][coverage-image]][coverage-url]
+  [![Contributing][contributing-image]][contributing-url]
 
-**[Documentation](https://open-retrievals.readthedocs.io)** | **[中文](https://github.com/LongxingTan/open-retrievals/blob/master/README_zh-CN.md)** | **[日本語](https://github.com/LongxingTan/open-retrievals/blob/master/README_ja-JP.md)**
+  **[Documentation](https://open-retrievals.readthedocs.io)** | **[中文](https://github.com/LongxingTan/open-retrievals/blob/master/README_zh-CN.md)** | **[日本語](https://github.com/LongxingTan/open-retrievals/blob/master/README_ja-JP.md)**
+
+</div>
 
 ![structure](./docs/source/_static/structure.png)
 
@@ -34,14 +40,15 @@
 - Reranking fine-tuned with Cross Encoder, ColBERT, and LLM.
 - Easily build enhanced RAG, integrated with Transformers, Langchain, and LlamaIndex.
 
-| Exp                       | Model                   | Original | Finetune  | Demo                                                                                                                                                                |
-|---------------------------|-------------------------|----------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| embed pairwise finetune   | bge-base-zh-v1.5        | 0.657    | **0.703** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing) |
-| embed LLM finetune (LoRA) | Qwen2-1.5B-Instruct     | 0.546    | **0.694** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing) |
-| rerank cross encoder      | bge-reranker-base       | 0.666    | **0.706** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing) |
-| rerank colbert            | chinese-roberta-wwm-ext | 0.643    | **0.687** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing) |
+| Exp                           | Model                   | Original | Finetuned | Demo                                                                                                                                                                |
+|-------------------------------|-------------------------|----------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| **embed** pairwise finetune   | bge-base-zh-v1.5        | 0.657    | **0.703** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing) |
+| **embed** LLM finetune (LoRA) | Qwen2-1.5B-Instruct     | 0.546    | **0.694** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing) |
+| **rerank** cross encoder      | bge-reranker-base       | 0.666    | **0.706** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing) |
+| **rerank** colbert            | chinese-roberta-wwm-ext | 0.643    | **0.687** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing) |
+| **rerank** LLM (LoRA)         | Qwen2-1.5B-Instruct     | 0.531    | **0.699** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1fzq1iV7-f8hNKFnjMmpVhVxadqPb9IXk?usp=sharing) |
 
-* The metrics is MAP in [t2-reranking data](https://huggingface.co/datasets/C-MTEB/T2Reranking). Original score of LLM and colbert original is Zero-shot
+* The metrics is MAP in 10% [t2-reranking data](https://huggingface.co/datasets/C-MTEB/T2Reranking). Original score of LLM and colbert original is Zero-shot
 
 
 ## Installation

diff --git a/README_ja-JP.md b/README_ja-JP.md
@@ -12,20 +12,25 @@
 [docs-url]: https://open-retrievals.readthedocs.io/en/latest/?version=latest
 [coverage-image]: https://codecov.io/gh/longxingtan/open-retrievals/branch/master/graph/badge.svg
 [coverage-url]: https://codecov.io/github/longxingtan/open-retrievals?branch=master
+[contributing-image]: https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat
+[contributing-url]: https://github.com/longxingtan/open-retrievals/blob/master/CONTRIBUTING.md
 
 <h1 align="center">
 <img src="./docs/source/_static/logo.svg" width="420" align=center/>
 </h1>
 
+<div align="center">
+
 [![LICENSE][license-image]][license-url]
 [![PyPI Version][pypi-image]][pypi-url]
 [![Build Status][build-image]][build-url]
 [![Lint Status][lint-image]][lint-url]
 [![Docs Status][docs-image]][docs-url]
 [![Code Coverage][coverage-image]][coverage-url]
-
+[![Contributing][contributing-image]][contributing-url]
 
 **[ドキュメント](https://open-retrievals.readthedocs.io)** | **[英語](https://github.com/LongxingTan/open-retrievals/blob/master/README.md)** | **[中文](https://github.com/LongxingTan/open-retrievals/blob/master/README_zh-CN.md)**
+</div>
 
 ![structure](./docs/source/_static/structure.png)
 
@@ -34,12 +39,13 @@
 - 対照学習エンベッディング, LLM エンベッディング
 - 高速 RAG デモ
 
-| Exp                       | Model                   | Size | Original | Finetune  | Demo                                                                                                                                                                |
-|---------------------------|-------------------------|------|----------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| embed pairwise finetune   | bge-base-zh-v1.5        | 0    | 0.657    | **0.703** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing) |
-| embed LLM finetune (LoRA) | Qwen2-1.5B-Instruct     | 0    | 0.546    | **0.694** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing) |
-| rerank cross encoder      | bge-reranker-base       | 0    | 0.666    | **0.706** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing) |
-| rerank colbert            | chinese-roberta-wwm-ext | 0    | 0.643    | **0.687** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing) |
+| Exp                           | Model                   | Size | Original | Finetuned | Demo                                                                                                                                                                |
+|-------------------------------|-------------------------|------|----------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| **embed** pairwise finetune   | bge-base-zh-v1.5        | -    | 0.657    | **0.703** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing) |
+| **embed** LLM finetune (LoRA) | Qwen2-1.5B-Instruct     | -    | 0.546    | **0.694** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing) |
+| **rerank** cross encoder      | bge-reranker-base       | -    | 0.666    | **0.706** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing) |
+| **rerank** colbert            | chinese-roberta-wwm-ext | -    | 0.643    | **0.687** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing) |
+| **rerank** LLM (LoRA)         | Qwen2-1.5B-Instruct     | -    | 0.531    | **0.699** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1fzq1iV7-f8hNKFnjMmpVhVxadqPb9IXk?usp=sharing) |
 
 
 ## インストール

diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -12,20 +12,25 @@
 [docs-url]: https://open-retrievals.readthedocs.io/en/latest/?version=latest
 [coverage-image]: https://codecov.io/gh/longxingtan/open-retrievals/branch/master/graph/badge.svg
 [coverage-url]: https://codecov.io/github/longxingtan/open-retrievals?branch=master
+[contributing-image]: https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat
+[contributing-url]: https://github.com/longxingtan/open-retrievals/blob/master/CONTRIBUTING.md
 
 <h1 align="center">
 <img src="./docs/source/_static/logo.svg" width="420" align=center/>
 </h1>
 
+<div align="center">
+
 [![LICENSE][license-image]][license-url]
 [![PyPI Version][pypi-image]][pypi-url]
 [![Build Status][build-image]][build-url]
 [![Lint Status][lint-image]][lint-url]
 [![Docs Status][docs-image]][docs-url]
 [![Code Coverage][coverage-image]][coverage-url]
+[![Contributing][contributing-image]][contributing-url]
 
-
-**[中文wiki](https://github.com/LongxingTan/open-retrievals/wiki)** | **[英文文档](https://open-retrievals.readthedocs.io)** | **[Release Notes](https://open-retrievals.readthedocs.io/en/latest/CHANGELOG.html)**
+**[中文wiki](https://github.com/LongxingTan/open-retrievals/wiki)** | **[英文文档](https://open-retrievals.readthedocs.io)**
+</div>
 
 ![structure](./docs/source/_static/structure.png)
 
@@ -34,14 +39,15 @@
 - 支持全套重排微调，cross encoder、ColBERT、LLM
 - 支持定制化RAG框架，支持在Transformers、Langchain、LlamaIndex中便捷使用微调后的模型
 
-| 实验              | 模型                     | 尺寸  | 原分数   | 微调分数      | Demo代码                                                                                                                           |
-|------------------|-------------------------|-----|-------|-----------|-------------------------------------------------------------------------------------------------------------------------------------|
-| 向量pairwise微调   | bge-base-zh-v1.5        | -   | 0.657 | **0.703** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing) |
-| 向量大模型LoRA微调  | Qwen2-1.5B-Instruct     | -   | 0.546 | **0.694** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing) |
-| cross encoder重排 | bge-reranker-base       | -   | 0.666 | **0.706** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing) |
-| colbert重排       | chinese-roberta-wwm-ext | -   | 0.643 | **0.687** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing) |
+| 实验                  | 模型                     | 尺寸| 原分数 | 微调分数   | Demo代码                                                                                                                           |
+|---------------------|-------------------------|----|-------|-----------|-------------------------------------------------------------------------------------------------------------------------------------|
+| **向量**pairwise微调   | bge-base-zh-v1.5        | -  | 0.657 | **0.703** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17KXe2lnNRID-HiVvMtzQnONiO74oGs91?usp=sharing) |
+| **向量**大模型LoRA微调  | Qwen2-1.5B-Instruct     | -  | 0.546 | **0.694** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jj1kBQWFcuQ3a7P9ttnl1hgX7H8WA_Za?usp=sharing) |
+| cross encoder**重排** | bge-reranker-base       | -  | 0.666 | **0.706** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QvbUkZtG56SXomGYidwI4RQzwODQrWNm?usp=sharing) |
+| colbert**重排**       | chinese-roberta-wwm-ext | -  | 0.643 | **0.687** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing) |
+| LLM**重排**           | Qwen2-1.5B-Instruct     | -  | 0.531 | **0.699** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1fzq1iV7-f8hNKFnjMmpVhVxadqPb9IXk?usp=sharing) |
 
-* 指标为[t2-reranking数据](https://huggingface.co/datasets/C-MTEB/T2Reranking)的MAP. 其中大模型与ColBERT原分数为Zero-shot
+* 指标为10% [t2-reranking数据](https://huggingface.co/datasets/C-MTEB/T2Reranking)的MAP. 其中大模型与ColBERT原分数为Zero-shot
 
 
 ## 安装

diff --git a/docs/source/embed.rst b/docs/source/embed.rst
@@ -113,5 +113,12 @@ Enhance the performance
 * Matryoshka
 
 
-Hard negative
-~~~~~~~~~~~~~~~~
+Hard mining
+~~~~~~~~~~~~~~~~~~~~~~
+offline hard mining
+
+online hard mining
+
+
+Ensemble embedding
+~~~~~~~~~~~~~~~~~~~~~~
diff --git a/docs/source/rag.rst b/docs/source/rag.rst
@@ -16,6 +16,10 @@ The basic RAG process is document indexing, query embedding, retrieval, optional
 Integrated with Langchain
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
+.. image:: https://colab.research.google.com/assets/colab-badge.svg
+    :target: https://colab.research.google.com/drive/1fJC-8er-a4NRkdJkwWr4On7lGt9rAO4P?usp=sharing
+    :alt: Open In Colab
+
 
 .. code-block:: python
 

diff --git a/examples/README.md b/examples/README.md
@@ -6,11 +6,11 @@
 - [embedding-llm pairwise finetune](./embedding_llm_finetune.py)
 - [rerank-cross encoder](./rerank_cross_encoder.py)
 - [rerank-colbert](./rerank_colbert.py)
-- [rerank-llm finetune](../reference/rerank_llm_finetune.py)
+- [rerank-llm finetune](rerank_llm_finetune.py)
 - [RAG with Langchain](./rag_langchain_demo.py)
 
 
-## Retrieval
+## Embedding
 
 **Data Format**
 ```
@@ -88,6 +88,26 @@ torchrun --nproc_per_node 1 \
 ```
 
 
+## Retrieval
+
+```shell
+QUERY_ENCODE_DIR=nq-queries
+OUT_DIR=temp
+MODEL_DIR="BAAI/bge-base-zh-v1.5"
+QUERY=nq-test-queries.json
+mkdir $QUERY_ENCODE_DIR
+
+python -m retrievals.pipelines.embed \
+    --model_name_or_path $MODEL_DIR \
+    --output_dir $OUT_DIR \
+    --do_encode \
+    --fp16 \
+    --per_device_eval_batch_size 256 \
+    --train_data $QUERY \
+    --is_query true
+```
+
+
 ## Rerank
 
 **Cross encoder reranking**
@@ -172,19 +192,24 @@ torchrun --nproc_per_node 1 \
     --positive_key positive \
     --negative_key negative \
     --learning_rate 2e-4 \
-    --num_train_epochs 1 \
-    --per_device_train_batch_size 1 \
+    --num_train_epochs 3 \
+    --per_device_train_batch_size 4 \
     --gradient_accumulation_steps 16 \
     --dataloader_drop_last True \
     --max_len 256 \
     --train_group_size 4 \
-    --logging_steps 1 \
-    --save_steps 2000 \
-    --save_total_limit 2 \
+    --logging_steps 10 \
+    --save_steps 20000 \
+    --save_total_limit 1 \
     --bf16
 ```
 
 
-## Common questions
-- If grad_norm during training is always zero, consider to change fp16 or bf16
-- If the fine-tuned embedding performance during inference is worse, check whether the pooling_method is correct, and the prompt is the same as training
+## FAQ
+
+The grad_norm during training is always zero?
+- consider to change fp16 or bf16
+
+The fine-tuned embedding performance during inference is worse than original?
+- check whether the pooling_method is correct
+- check whether the prompt is the same as training for LLM model
diff --git a/examples/eval/README.md b/examples/eval/README.md
@@ -1,4 +1,11 @@
-# 评测
+# Evaluation
+
+**Prerequisites**
+```shell
+pip install datasets mteb[beir]
+pip install C_MTEB
+pip install open-retrievals
+```
 
 
 ```python

diff --git a/examples/rerank_llm_finetune.py b/examples/rerank_llm_finetune.py
diff --git a/examples/t2_ranking/README.md b/examples/t2_ranking/README.md
@@ -1,6 +1,8 @@
 # T2_ranking
 
-## Performance
+An end-to-end example with [t2-reranking data](https://huggingface.co/datasets/C-MTEB/T2Reranking)
+
+## Experiment
 
 bge-base-zh-v1.5
 - "map": 0.6569549236524207, "mrr": 0.7683207806932297
@@ -11,29 +13,36 @@ bge-reranker-base
 - rerank/cross-encoder: "map": 0.6906494118852755, "mrr": 0.8064902548320916
 
 
-## Prepare dataset
+## 1. Prepare dataset
 ```shell
 python prepare_t2ranking_data.py
 ```
 
-## Train
-```shell
+## 2. Finetune embedding
 
+```shell
+sh pairwise_embed_train.sh
 ```
 
-## Encode corpus
+## Indexing
+Encode corpus
 ```shell
-
+sh encode_corpus.sh
 ```
 
-## Encode Query
+Encode Query
 ```shell
-
+sh encode_query.sh
 ```
 
-## Search
+## Retrieve
 ```shell
+sh retrieve.sh
+```
 
+## Rerank
+```shell
+sh rerank.sh
 ```
 
 ## Evaluate

diff --git a/examples/t2_ranking/encode_corpus.sh b/examples/t2_ranking/encode_corpus.sh
diff --git a/examples/t2_ranking/encode_query.sh b/examples/t2_ranking/encode_query.sh