RoBERTa Marathi Language model trained from scratch during huggingface 🤗 x flax community week

Overview

RoBERTa base model for Marathi Language (मराठी भाषा)

Pretrained model on Marathi language using a masked language modeling (MLM) objective. RoBERTa was introduced in this paper and first released in this repository. We trained RoBERTa model for Marathi Language during community week hosted by Huggingface 🤗 using JAX/Flax for NLP & CV jax.

RoBERTa base model for Marathi language (मराठी भाषा)

huggingface-marathi-roberta

Model description

Marathi RoBERTa is a transformers model pretrained on a large corpus of Marathi data in a self-supervised fashion.

Intended uses & limitations ❗️

You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. We used this model to fine tune on text classification task for iNLTK and indicNLP news text classification problem statement. Since marathi mc4 dataset is made by scraping marathi newspapers text, it will involve some biases which will also affect all fine-tuned versions of this model.

How to use

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='flax-community/roberta-base-mr')
>>> unmasker("मोठी बातमी! उद्या दुपारी <mask> वाजता जाहीर होणार दहावीचा निकाल")
[{'score': 0.057209037244319916,'sequence': 'मोठी बातमी! उद्या दुपारी आठ वाजता जाहीर होणार दहावीचा निकाल',
  'token': 2226,
  'token_str': 'आठ'},
 {'score': 0.02796074189245701,
  'sequence': 'मोठी बातमी! उद्या दुपारी २० वाजता जाहीर होणार दहावीचा निकाल',
  'token': 987,
  'token_str': '२०'},
 {'score': 0.017235398292541504,
  'sequence': 'मोठी बातमी! उद्या दुपारी नऊ वाजता जाहीर होणार दहावीचा निकाल',
  'token': 4080,
  'token_str': 'नऊ'},
 {'score': 0.01691395975649357,
  'sequence': 'मोठी बातमी! उद्या दुपारी २१ वाजता जाहीर होणार दहावीचा निकाल',
  'token': 1944,
  'token_str': '२१'},
 {'score': 0.016252165660262108,
  'sequence': 'मोठी बातमी! उद्या दुपारी  ३ वाजता जाहीर होणार दहावीचा निकाल',
  'token': 549,
  'token_str': ' ३'}]

Training data 🏋🏻‍♂️

The RoBERTa Marathi model was pretrained on mr dataset of C4 multilingual dataset:

C4 (Colossal Clean Crawled Corpus), Introduced by Raffel et al. in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.

The dataset can be downloaded in a pre-processed form from allennlp or huggingface's datsets - mc4 dataset. Marathi (mr) dataset consists of 14 billion tokens, 7.8 million docs and with weight ~70 GB of text.

Data Cleaning 🧹

Though initial mc4 marathi corpus size ~70 GB, Through data exploration, it was observed it contains docs from different languages especially thai, chinese etc. So we had to clean the dataset before traning tokenizer and model. Surprisingly, results after cleaning Marathi mc4 corpus data:

Train set:

Clean docs count 1581396 out of 7774331.
~20.34% of whole marathi train split is actually Marathi.

Validation set

Clean docs count 1700 out of 7928.
~19.90% of whole marathi validation split is actually Marathi.

Training procedure 👨🏻‍💻

Preprocessing

The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked with <s> and the end of one by </s> The details of the masking procedure for each sentence are the following:

  • 15% of the tokens are masked.
  • In 80% of the cases, the masked tokens are replaced by <mask>.
  • In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
  • In the 10% remaining cases, the masked tokens are left as is. Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).

Pretraining

The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) 8 v3 TPU cores for 42K steps with a batch size of 128 and a sequence length of 128. The optimizer used is Adam with a learning rate of 3e-4, β1 = 0.9, β2 = 0.98 and ε = 1e-8, a weight decay of 0.01, learning rate warmup for 1,000 steps and linear decay of the learning rate after.

We tracked experiments and hyperparameter tunning on weights and biases platform. Here is link to main dashboard:
Link to Weights and Biases Dashboard for Marathi RoBERTa model

Pretraining Results 📊

RoBERTa Model reached eval accuracy of 85.28% around ~35K step with train loss at 0.6507 and eval loss at 0.6219.

Fine Tuning on downstream tasks

We performed fine-tuning on downstream tasks. We used following datasets for classification:

  1. IndicNLP Marathi news classification
  2. iNLTK Marathi news headline classification

Fine tuning on downstream task results (Segregated)

1. IndicNLP Marathi news classification

IndicNLP Marathi news dataset consists 3 classes - ['lifestyle', 'entertainment', 'sports'] - with following docs distribution as per classes:

train eval test
9672 477 478

💯 Our Marathi RoBERTa **roberta-base-mr model outperformed both classifier ** mentioned in Arora, G. (2020). iNLTK and Kunchukuttan, Anoop et al. AI4Bharat-IndicNLP.

Dataset FT-W FT-WC INLP iNLTK roberta-base-mr 🏆
iNLTK Headlines 83.06 81.65 89.92 92.4 97.48

🤗 Huggingface Model hub repo:
roberta-base-mr fine tuned on iNLTK Headlines classification dataset model:

flax-community/mr-indicnlp-classifier

🧪 Fine tuning experiment's weight and biases dashboard link

2. iNLTK Marathi news headline classification

This dataset consists 3 classes - ['state', 'entertainment', 'sports'] - with following docs distribution as per classes:

train eval test
9658 1210 1210

💯 Here as well roberta-base-mr outperformed iNLTK marathi news text classifier.

Dataset iNLTK ULMFiT roberta-base-mr 🏆
iNLTK news dataset (kaggle) 92.4 94.21

🤗 Huggingface Model hub repo:
roberta-base-mr fine tuned on iNLTK news classification dataset model:

flax-community/mr-inltk-classifier

Fine tuning experiment's weight and biases dashboard link

Want to check how above models generalise on real world Marathi data?

Head to 🤗 Huggingface's spaces 🪐 to play with all three models:

  1. Mask Language Modelling with Pretrained Marathi RoBERTa model:
    flax-community/roberta-base-mr
  2. Marathi Headline classifier:
    flax-community/mr-indicnlp-classifier
  3. Marathi news classifier:
    flax-community/mr-inltk-classifier

alt text Streamlit app of Pretrained Roberta Marathi model on Huggingface Spaces

image

Team Members

Credits

Huge thanks to Huggingface 🤗 & Google Jax/Flax team for such a wonderful community week. Especially for providing such massive computing resource. Big thanks to @patil-suraj & @patrickvonplaten for mentoring during whole week.

Owner
Nipun Sadvilkar
I like to explore Jungle of Data with Python as my swiss knife with pandas, numpy, matplotlib and scikit-learn as its multi-tools😅
Nipun Sadvilkar
Graph InfoClust: Leveraging cluster-level node information for unsupervised graph representation learning

Graph-InfoClust-GIC [PAKDD 2021] PAKDD'21 version Graph InfoClust: Maximizing Coarse-Grain Mutual Information in Graphs Preprint version Graph InfoClu

Costas Mavromatis 21 Dec 03, 2022
PyTorch Implementation of CvT: Introducing Convolutions to Vision Transformers

CvT: Introducing Convolutions to Vision Transformers Pytorch implementation of CvT: Introducing Convolutions to Vision Transformers Usage: img = torch

Rishikesh (ऋषिकेश) 193 Jan 03, 2023
Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"

Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning This is the Github repository of our paper, "Common S

INK Lab @ USC 19 Nov 30, 2022
Voice assistant - Voice assistant with python

🌐 Python Voice Assistant 🌵 - User's greeting 🌵 - Writing tasks to todo-list ?

PythonToday 10 Dec 26, 2022
ConvMixer unofficial implementation

ConvMixer ConvMixer 非官方实现 pytorch 版本已经实现。 nets 是重构版本 ,test 是官方代码 感兴趣小伙伴可以对照看一下。 keras 已经实现 tf2.x 中 是tensorflow 2 版本 gelu 激活函数要求 tf=2.4 否则使用入下代码代替gelu

Jian Tengfei 8 Jul 11, 2022
This repository contains the code for TACL2021 paper: SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization

SummaC: Summary Consistency Detection This repository contains the code for TACL2021 paper: SummaC: Re-Visiting NLI-based Models for Inconsistency Det

Philippe Laban 24 Jan 03, 2023
A Lighting Pytorch Framework for Recommendation System, Easy-to-use and Easy-to-extend.

Torch-RecHub A Lighting Pytorch Framework for Recommendation Models, Easy-to-use and Easy-to-extend. 安装 pip install torch-rechub 主要特性 scikit-learn风格易用

Mincai Lai 67 Jan 04, 2023
Finetune the base 64 px GLIDE-text2im model from OpenAI on your own image-text dataset

Finetune the base 64 px GLIDE-text2im model from OpenAI on your own image-text dataset

Clay Mullis 82 Oct 13, 2022
LSTM and QRNN Language Model Toolkit for PyTorch

LSTM and QRNN Language Model Toolkit This repository contains the code used for two Salesforce Research papers: Regularizing and Optimizing LSTM Langu

Salesforce 1.9k Jan 08, 2023
Code for the paper "MASTER: Multi-Aspect Non-local Network for Scene Text Recognition" (Pattern Recognition 2021)

MASTER-PyTorch PyTorch reimplementation of "MASTER: Multi-Aspect Non-local Network for Scene Text Recognition" (Pattern Recognition 2021). This projec

Wenwen Yu 255 Dec 29, 2022
Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World

Legged Robots that Keep on Learning Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World, whic

Laura Smith 70 Dec 07, 2022
A deep neural networks for images using CNN algorithm.

Example-CNN-Project This is a simple project showing how to implement deep neural networks using CNN algorithm. The dataset is taken from this link: h

Mohammad Amin Dadgar 3 Sep 16, 2022
AoT is a system for automatically generating off-target test harness by using build information.

AoT: Auto off-Target Automatically generating off-target test harness by using build information. Brought to you by the Mobile Security Team at Samsun

Samsung 10 Oct 19, 2022
[ACMMM 2021 Oral] Enhanced Invertible Encoding for Learned Image Compression

InvCompress Official Pytorch Implementation for "Enhanced Invertible Encoding for Learned Image Compression", ACMMM 2021 (Oral) Figure: Our framework

96 Nov 30, 2022
a dnn ai project to classify which food people are eating on audio recordings

Deep Learning - EAT Challenge About This project is part of an AI challenge of the DeepLearning course 2021 at the University of Augsburg. The objecti

Marco Tröster 1 Oct 24, 2021
Code for visualizing the loss landscape of neural nets

Visualizing the Loss Landscape of Neural Nets This repository contains the PyTorch code for the paper Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer

Tom Goldstein 2.2k Jan 09, 2023
Introduction to AI assignment 1 HCM University of Technology, term 211

Sokoban Bot Introduction to AI assignment 1 HCM University of Technology, term 211 Abstract This is basically a solver for Sokoban game using Breadth-

Quang Minh 4 Dec 12, 2022
Trading environnement for RL agents, backtesting and training.

TradzQAI Trading environnement for RL agents, backtesting and training. Live session with coinbasepro-python is finaly arrived ! Available sessions: L

Tony Denion 164 Oct 30, 2022
Tutorial for the PERFECTING FACTORY 5.0 WITH EDGE-POWERED AI workshop

Workshop Advantech Jetson Nano This tutorial has been designed for the PERFECTING FACTORY 5.0 WITH EDGE-POWERED AI workshop in collaboration with Adva

Edge Impulse 18 Nov 22, 2022