Legal text retrieval for python

Last update: Dec 06, 2022

Related tags

Text Data & NLP legal_text_retrieval

Overview

legal-text-retrieval

Overview

This system contains 2 steps:

generate training data containing negative sample found by mixture score of cosine(tfidf) + bm25 (using top 150 law articles most similarity)
fine-tune PhoBERT model (+NlpHUST model - optional) on generated data

Environments

git clone https://github.com/vncorenlp/VnCoreNLP.git vncorenlp_data # for vncorebnlp tokenize lib

conda create -n legal_retrieval_env python=3.8
conda activate legal_retrieval_env
pip install -r requirements.txt

Run

Generate data from folder data/zac2021-ltr-data/ containing public_test_question.json and train_question_answer.json
```
python3 src/data_generator.py --path_folder_base data/zac2021-ltr-data/ --test_file public_test_question.json --topk 150  --tok --path_output_dir data/zalo-tfidfbm25150-full
```
Note:
- --test_file public_test_question.json is optional, if this parameter is not used, test set will be random 33% in file train_question_answer.json
- --path_output_dir is the folder save 3 output file (train.csv, dev.csv, test.csv) and tfidf classifier (tfidf_classifier.pkl) for top k best relevant documents.

Train model

bash scripts/run_finetune_bert.sh "magic"  vinai/phobert-base  ../  data/zalo-tfidfbm25150-full Tfbm150E5-full 5

Predict
```
python3 src/infer.py 
```
Note: This script will load model and run prediction, pls check the variable model_configs in file src/infer.py to modify.

License

MIT-licensed.

Citation

Please cite as:

@article{DBLP:journals/corr/abs-2106-13405,
  author    = {Ha{-}Thanh Nguyen and
               Phuong Minh Nguyen and
               Thi{-}Hai{-}Yen Vuong and
               Quan Minh Bui and
               Chau Minh Nguyen and
               Tran Binh Dang and
               Vu Tran and
               Minh Le Nguyen and
               Ken Satoh},
  title     = {{JNLP} Team: Deep Learning Approaches for Legal Processing Tasks in
               {COLIEE} 2021},
  journal   = {CoRR},
  volume    = {abs/2106.13405},
  year      = {2021},
  url       = {https://arxiv.org/abs/2106.13405},
  eprinttype = {arXiv},
  eprint    = {2106.13405},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2106-13405.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

@article{DBLP:journals/corr/abs-2011-08071,
  author    = {Ha{-}Thanh Nguyen and
               Hai{-}Yen Thi Vuong and
               Phuong Minh Nguyen and
               Tran Binh Dang and
               Quan Minh Bui and
               Vu Trong Sinh and
               Chau Minh Nguyen and
               Vu D. Tran and
               Ken Satoh and
               Minh Le Nguyen},
  title     = {{JNLP} Team: Deep Learning for Legal Processing in {COLIEE} 2020},
  journal   = {CoRR},
  volume    = {abs/2011.08071},
  year      = {2020},
  url       = {https://arxiv.org/abs/2011.08071},
  eprinttype = {arXiv},
  eprint    = {2011.08071},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2011-08071.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Legal text retrieval for python

Related tags

Overview

legal-text-retrieval

Overview

Environments

Run

License

Citation

Owner

Nguyễn Minh Phương

Translation to python of Chris Sims' optimization function

Python library for parsing resumes using natural language processing and machine learning

⚖️ A Statutory Article Retrieval Dataset in French.

Simple and efficient RevNet-Library with DeepSpeed support

Implementation of the Hybrid Perception Block and Dual-Pruned Self-Attention block from the ITTR paper for Image to Image Translation using Transformers

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe NHV in the future.

Repository for Project Insight: NLP as a Service

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

NLP tool to extract emotional phrase from tweets 🤩

Ελληνικά νέα (Python script) / Greek News Feed (Python script)

Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

code for modular summarization work published in ACL2021 by Krishna et al

A demo of chinese asr

A PyTorch implementation of the Transformer model in "Attention is All You Need".

Open-World Entity Segmentation

ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python)

Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

Smart discord chatbot integrated with Dialogflow

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.