chaii - hindi & tamil question answering

Overview

chaii - hindi & tamil question answering

This is the solution for rank 5th in Kaggle competition: chaii - Hindi and Tamil Question Answering. The competition can be found here: https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering

Datasets required

Download squadv2 data from https://rajpurkar.github.io/SQuAD-explorer/

$ mkdir input && cd input
$ wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
$ wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

Download tydiqa data in the input folder:

$ wget https://storage.googleapis.com/tydiqa/v1.1/tydiqa-goldp-v1.1-train.json
$ wget https://storage.googleapis.com/tydiqa/v1.1/tydiqa-goldp-v1.1-dev.json

Download data from https://www.kaggle.com/tkm2261/google-translated-squad20-to-hindi-and-tamil to input folder

Download original competition dataset to input folder: https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/data

Download outputs of this kernel: https://www.kaggle.com/rhtsingh/external-data-mlqa-xquad-preprocessing/ to input folder

Now, you have all the data needed to train the model. We will first create folds and munge the data a bit.

To create folds, please use the following command:

$ cd src
$ python create_folds.py

To munge the datasets and prepare for training, please run the following command:

$ cd src
$ python munge_data.py

Training

There are two GPU models and one model needs TPUs.

GPU models: XLM-Roberta & Rembert TPU model: Muril-Large

XLM-Roberta:

$ cd src
$ TOKENIZERS_PARALLELISM=false python xlm_roberta.py --fold 0
$ TOKENIZERS_PARALLELISM=false python xlm_roberta.py --fold 1
$ TOKENIZERS_PARALLELISM=false python xlm_roberta.py --fold 2
$ TOKENIZERS_PARALLELISM=false python xlm_roberta.py --fold 3
$ TOKENIZERS_PARALLELISM=false python xlm_roberta.py --fold 4

Rembert:

$ cd src
$ TOKENIZERS_PARALLELISM=false python rembert.py --fold 0
$ TOKENIZERS_PARALLELISM=false python rembert.py --fold 1
$ TOKENIZERS_PARALLELISM=false python rembert.py --fold 2
$ TOKENIZERS_PARALLELISM=false python rembert.py --fold 3
$ TOKENIZERS_PARALLELISM=false python rembert.py --fold 4

Muril-Large

** please note that training this model needs TPUs **

$ cd src
$ TOKENIZERS_PARALLELISM=false python muril_large.py --fold 0
$ TOKENIZERS_PARALLELISM=false python muril_large.py --fold 1
$ TOKENIZERS_PARALLELISM=false python muril_large.py --fold 2
$ TOKENIZERS_PARALLELISM=false python muril_large.py --fold 3
$ TOKENIZERS_PARALLELISM=false python muril_large.py --fold 4

Inference

After training all the models, the outputs were pushed to Kaggle Datasets.

The final model datasets can be found here:

- https://www.kaggle.com/abhishek/xlmrobertalargewithsquadv2tydiqasqdtrans384f
- https://www.kaggle.com/ubamba98/modelsrembertwithsquadv2tydiqa384
- https://www.kaggle.com/ubamba98/murillargecasedchaii

And the final inference kernel can be found here: https://www.kaggle.com/abhishek/chaii-xlm-roberta-x-muril-x-rembert-score-based

Solution writeup: https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/discussion/288049

Owner
abhishek thakur
Kaggle: www.kaggle.com/abhishek
abhishek thakur
Implementation of some unbalanced loss like focal_loss, dice_loss, DSC Loss, GHM Loss et.al

Implementation of some unbalanced loss for NLP task like focal_loss, dice_loss, DSC Loss, GHM Loss et.al Summary Here is a loss implementation reposit

121 Jan 01, 2023
Chinese segmentation library

What is loso? loso is a Chinese segmentation system written in Python. It was developed by Victor Lin ( Fang-Pen Lin 82 Jun 28, 2022

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.6k Dec 27, 2022
🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Pretrained BigBird Model for Korean What is BigBird • How to Use • Pretraining • Evaluation Result • Docs • Citation 한국어 | English What is BigBird? Bi

Jangwon Park 183 Dec 14, 2022
Flaxformer: transformer architectures in JAX/Flax

Flaxformer: transformer architectures in JAX/Flax Flaxformer is a transformer library for primarily NLP and multimodal research at Google. It is used

Google 114 Dec 29, 2022
PortaSpeech - PyTorch Implementation

PortaSpeech - PyTorch Implementation PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech. Model Size Module Nor

Keon Lee 276 Dec 26, 2022
Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

1.1k Dec 27, 2022
A python package to fine-tune transformer-based models for named entity recognition (NER).

nerblackbox A python package to fine-tune transformer-based language models for named entity recognition (NER). Resources Source Code: https://github.

Felix Stollenwerk 13 Jul 30, 2022
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 08, 2022
This is the writeup of all the challenges from Advent-of-cyber-2019 of TryHackMe

Advent-of-cyber-2019-writeup This is the writeup of all the challenges from Advent-of-cyber-2019 of TryHackMe https://tryhackme.com/shivam007/badges/c

shivam danawale 5 Jul 17, 2022
Easy, fast, effective, and automatic g-code compression!

Getting to the meat of g-code. Easy, fast, effective, and automatic g-code compression! MeatPack nearly doubles the effective data rate of a standard

Scott Mudge 97 Nov 21, 2022
189 Jan 02, 2023
Mesh TensorFlow: Model Parallelism Made Easier

Mesh TensorFlow - Model Parallelism Made Easier Introduction Mesh TensorFlow (mtf) is a language for distributed deep learning, capable of specifying

1.3k Dec 26, 2022
CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

- 基于标题的大规模商品实体检索top1 一、任务介绍 CCKS 2020:基于标题的大规模商品实体检索,任务为对于给定的一个商品标题,参赛系统需要匹配到该标题在给定商品库中的对应商品实体。 输入:输入文件包括若干行商品标题。 输出:输出文本每一行包括此标题对应的商品实体,即给定知识库中商品 ID,

43 Nov 11, 2022
Edge-Augmented Graph Transformer

Edge-augmented Graph Transformer Introduction This is the official implementation of the Edge-augmented Graph Transformer (EGT) as described in https:

Md Shamim Hussain 21 Dec 14, 2022
An automated program that helps customers of Pizza Palour place their pizza orders

PIzza_Order_Assistant Introduction An automated program that helps customers of Pizza Palour place their pizza orders. The program uses voice commands

Tindi Sommers 1 Dec 26, 2021
Pytorch version of BERT-whitening

BERT-whitening This is the Pytorch implementation of "Whitening Sentence Representations for Better Semantics and Faster Retrieval". BERT-whitening is

Weijie Liu 255 Dec 27, 2022
T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets (product titles, images, comments, etc.).

55 Nov 22, 2022
Train 🤗transformers with DeepSpeed: ZeRO-2, ZeRO-3

Fork from https://github.com/huggingface/transformers/tree/86d5fb0b360e68de46d40265e7c707fe68c8015b/examples/pytorch/language-modeling at 2021.05.17.

Junbum Lee 12 Oct 26, 2022
Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch

Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch. Topics: Face detection with Detectron 2, Time Series anomaly detection with LSTM Autoenc

Venelin Valkov 1.8k Dec 31, 2022