Legal Feature Enhanced Semantic Matching Network for Similar Case Matching

Description

This repository is the source code of the paper Legal Feature Enhanced Semantic Matching Network for Similar Case Matching implemented via PyTorch.

Model Overview

Fig. 1 Overview of LFESM

Install and Run

Install

Python 3.6+
PyTorch 1.1.0+
Python requirements: run pip install -r requirements.txt.
Nvidia Apex (optional): Nvidia Apex enables mixed-precision training to accelerate the training procedure and decrease the memory usage. See official doc for installation. Specify fp16 = False in train.py to disable it.
Hardware: we recommend to use GPU to train LFESM. In our experiment, when we train the model with batch_size = 3 and fp16 = True on 2* GeForce RTX 2080, it takes 1~1.5 hour to finish one epoch.
Dataset: see Dataset.
BERT pretrained model: download the pretrained model here, and unzip the model into ./bert folder. See OpenCLaP for more details.

Train

python train.py

Our default parameters:

config = {
    "max_length": 512,
    "epochs": 6,
    "batch_size": 3,
    "learning_rate": 2e-5,
    "fp16": True,
    "fp16_opt_level": "O1",
    "max_grad_norm": 1.0,
    "warmup_steps": 0.1,
}

Predict

python predict.py

The output of the prediction is stored in ./data/test/output.txt.

Run ./scripts/judger.py to calculate the accuracy score.

Dataset

Download the dataset CAIL2019-SCM from here. Check CAIL2019 for more details about the dataset.

Table 1: The amount of data in CAIL2019-SCM

Dataset	sim(a, b)>sim(a,c)	sim(a,b)<sim(a,c)	Total Amount
Train	2,596	2,506	5,102
Valid	837	663	1,500
Test	803	733	1,536

Unzip the dataset and put the train, valid, and test set into raw, valid, and test folder of ./data folder.

Data Augmentation

To fulfill the distribution of dataset and enhance the performance of model training, we apply data augmentation to the dataset.

Let's denote the original triplet as (A, B, C). We add (A, C, B) into the dataset, which makes the amount multiply twice. We also tried other methods like (B, C, A) and (B, A, C), but they did not work.

Project Files

lfesm
├── bert                   # BERT pretrained model
├── config.py              # Model config and hyper parameter
├── data                   # Store the dataset
├── data.py                # Define the dataset
│   └── ...
├── model.py               # Define the model trainer
├── models                 # Define the models
│   ├── baseline
│   ├── esim
│   ├── feature.py
│   ├── lfesm.py
├── predict.py             # Predict
├── scripts                # Utilities
│   └── ...
├── train.py               # Train
└── util.py                # Utility function

Results

Table 2: Experimental results of methods on CAIL2019-SCM

	Method	Valid	Test
Baseline	BERT	61.93	67.32
	LSTM	62.00	68.00
	CNN	62.27	69.53
Our Baseline	BERT	64.53	65.59
	LSTM	64.33	66.34
	CNN	64.73	67.25
Best Score	11.2yuan	66.73	72.07
	backward	67.73	71.81
	AlphaCourt	70.07	72.66
Our Method	LFESM	70.01	74.15

Reference

[1] coetaur0/ESIM

[2] padeoe/cail2019

[3] thunlp/OpenCLaP

[4] CAIL2019-SCM

[5] Taoooo9/Cail_Text_similarity_esimtribert

Acknowledgement

We sincerely appreciate Taoooo9‘s help.

Author: Zhilong Hong, Qifei Zhou, Rong Zhang, Weiping Li, and Tong Mo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Legal Feature Enhanced Semantic Matching Network for Similar Case Matching

Description

Model Overview

Install and Run

Install

Train

Predict

Dataset

Data Augmentation

Project Files

Results

Reference

Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

Legal Feature Enhanced Semantic Matching Network for Similar Case Matching

Description

Model Overview

Install and Run

Install

Train

Predict

Dataset

Data Augmentation

Project Files

Results

Reference

Acknowledgement