Skip to content

Latest commit

 

History

History
135 lines (96 loc) · 4.77 KB

README.md

File metadata and controls

135 lines (96 loc) · 4.77 KB

Legal Feature Enhanced Semantic Matching Network for Similar Case Matching

Description

This repository is the source code of the paper Legal Feature Enhanced Semantic Matching Network for Similar Case Matching implemented via PyTorch.

Model Overview

Model Fig. 1 Overview of LFESM

Install and Run

Install

  • Python 3.6+

  • PyTorch 1.1.0+

  • Python requirements: run pip install -r requirements.txt.

  • Nvidia Apex (optional): Nvidia Apex enables mixed-precision training to accelerate the training procedure and decrease the memory usage. See official doc for installation. Specify fp16 = False in train.py to disable it.

  • Hardware: we recommend to use GPU to train LFESM. In our experiment, when we train the model with batch_size = 3 and fp16 = True on 2* GeForce RTX 2080, it takes 1~1.5 hour to finish one epoch.

  • Dataset: see Dataset.

  • BERT pretrained model: download the pretrained model here, and unzip the model into ./bert folder. See OpenCLaP for more details.

Train

python train.py

Our default parameters:

config = {
    "max_length": 512,
    "epochs": 6,
    "batch_size": 3,
    "learning_rate": 2e-5,
    "fp16": True,
    "fp16_opt_level": "O1",
    "max_grad_norm": 1.0,
    "warmup_steps": 0.1,
}

Predict

python predict.py

The output of the prediction is stored in ./data/test/output.txt.

Run ./scripts/judger.py to calculate the accuracy score.

Dataset

Download the dataset CAIL2019-SCM from here. Check CAIL2019 for more details about the dataset.

Table 1: The amount of data in CAIL2019-SCM

Dataset sim(a, b)>sim(a,c)​ sim(a,b)<sim(a,c)​ Total Amount
Train 2,596 2,506 5,102
Valid 837 663 1,500
Test 803 733 1,536

Unzip the dataset and put the train, valid, and test set into raw, valid, and test folder of ./data folder.

Data Augmentation

To fulfill the distribution of dataset and enhance the performance of model training, we apply data augmentation to the dataset.

Let's denote the original triplet as (A, B, C). We add (A, C, B) into the dataset, which makes the amount multiply twice. We also tried other methods like (B, C, A) and (B, A, C), but they did not work.

Project Files

lfesm
├── bert                   # BERT pretrained model
├── config.py              # Model config and hyper parameter
├── data                   # Store the dataset
├── data.py                # Define the dataset
│   └── ...
├── model.py               # Define the model trainer
├── models                 # Define the models
│   ├── baseline
│   ├── esim
│   ├── feature.py
│   ├── lfesm.py
├── predict.py             # Predict
├── scripts                # Utilities
│   └── ...
├── train.py               # Train
└── util.py                # Utility function

Results

Table 2: Experimental results of methods on CAIL2019-SCM

Method Valid Test
Baseline BERT 61.93 67.32
LSTM 62.00 68.00
CNN 62.27 69.53
Our Baseline BERT 64.53 65.59
LSTM 64.33 66.34
CNN 64.73 67.25
Best Score 11.2yuan 66.73 72.07
backward 67.73 71.81
AlphaCourt 70.07 72.66
Our Method LFESM 70.01 74.15

Reference

[1] coetaur0/ESIM

[2] padeoe/cail2019

[3] thunlp/OpenCLaP

[4] CAIL2019-SCM

[5] Taoooo9/Cail_Text_similarity_esimtribert

Acknowledgement

We sincerely appreciate Taoooo9‘s help.


Author: Zhilong Hong, Qifei Zhou, Rong Zhang, Weiping Li, and Tong Mo.