LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Last update: Jan 12, 2022

Related tags

Deep Learning ZaloAI2021_LTR

Overview

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

We propose a cross encoder model (LTR_CrossEncoder) for information retrieval, re-retrieval text relevant base on result of elasticsearch

Model achieved 0.747 F2 score in public test (Legal Text Retrieval Zalo AI Challenge 2021)
If using elasticsearch only, our F2 score is 0.54

Algorithm design

Our algorithm includes two key components:

Elasticsearch
Cross Encoder Model

Elasticsearch

Elasticsearch is used for filtering top-k most relevant articles based on BM25 score.

Cross Encoder Model

Our model accepts query, article text (passage) and article title as inputs and outputs a relevant score of that query and that article. Higher score, more relavant. We use pretrained vinai/phobert-base and CrossEntropyLoss or BCELoss as loss function

Train dataset

Non-relevant samples in dataset are obtained by top-10 result of elasticsearch, the training data (train_data_model.json) has format as follow:

[
    {
        "question_id": "..."
        "question": "..."
        "relevant_articles":[
            {
                "law_id": "..."
                "article_id": "..."
                "title": "..."
                "text": "..."
            },
            ...
        ]
        "non_relevant_articles":[
            {
                "law_id": "..."
                "article_id": "..."
                "title": "..."
                "text": "..."
            },
            ...
        ]
    },
    ...
]

Test dataset

First we use elasticsearch to obtain k relevant candidates (k=top-50 result of elasticsearch), then LTR_CrossEncoder classify which actual relevant article. The test data (test_data_model.json) has format as follow:

[
    {
        "question_id": "..."
        "question": "..."
        "articles":[
            {
                "law_id": "..."
                "article_id": "..."
                "title": "..."
                "text": "..."
            },
            ...
        ]
    },
    ...
]

Training

Run the following bash file to train model:

bash run_phobert.sh

Inference

We also provide model checkpoints. Please download these checkpoints if you want to make inference on a new text file without training the models from scratch. Create new checkpoint folder, unzip model file and push it in checkpoint folder. https://drive.google.com/file/d/1oT8nlDIAatx3XONN1n5eOgYTT6Lx_h_C/view?usp=sharing

Run the following bash file to infer test dataset:

bash run_predict.sh

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Related tags

Overview

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Algorithm design

Elasticsearch

Cross Encoder Model

Train dataset

Test dataset

Training

Inference

Owner

Xuan Hieu Duong

PoolFormer: MetaFormer is Actually What You Need for Vision

A brand new hub for Scene Graph Generation methods based on MMdetection (2021). The pipeline of from detection, scene graph generation to downstream tasks (e.g., image cpationing) is supported. Pytorch version implementation of HetH (ECCV 2020) and TopicSG (ICCV 2021) is included.

Codes and pretrained weights for winning submission of 2021 Brain Tumor Segmentation (BraTS) Challenge

Conversion between units used in magnetism

This program was designed to detect whether someone is wearing a facemask through a live video stream.

Change is Everywhere: Single-Temporal Supervised Object Change Detection in Remote Sensing Imagery (ICCV 2021)

Python library for analysis of time series data including dimensionality reduction, clustering, and Markov model estimation

OpenDILab RL Kubernetes Custom Resource and Operator Lib

Unofficial implementation of "TTNet: Real-time temporal and spatial video analysis of table tennis" (CVPR 2020)

Img-process-manual - Utilize Python Numpy and Matplotlib to realize OpenCV baisc image processing function

Multiple Object Extraction from Aerial Imagery with Convolutional Neural Networks

Implementations for the ICLR-2021 paper: SEED: Self-supervised Distillation For Visual Representation.

Semantic similarity computation with different state-of-the-art metrics

Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

Utility code for use with PyXLL

Deep learning for Engineers - Physics Informed Deep Learning

Online Pseudo Label Generation by Hierarchical Cluster Dynamics for Adaptive Person Re-identification

「PyTorch Implementation of AnimeGANv2」を用いて、生成した顔画像を元の画像に上書きするデモ

MTA:SA Server Configer.

Starter code for the ICCV 2021 paper, 'Detecting Invisible People'