DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Last update: Jan 07, 2023

Related tags

Overview

DziriBERT

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect. It handles Algerian text contents written using both Arabic and Latin characters. It sets new state of the art results on Algerian text classification datasets, even if it has been pre-trained on much less data (~1 million tweets).

The model is publicly available at: https://huggingface.co/alger-ia/dziribert.

For more information, please visit our paper: https://arxiv.org/pdf/2109.12346.pdf

Evaluation

The Twifil dataset was used to compare DziriBERT with current multilingual, standard Arabic and dialectal Arabic models:

Model	Sentiment acc.	Emotion acc.
bert-base-multilingual-cased	73.6 %	59.4 %
aubmindlab/bert-base-arabert	72.1 %	61.2 %
CAMeL-Lab/bert-base-arabic-camelbert-mix	77.1 %	65.7 %
qarib/bert-base-qarib	77.7 %	67.6 %
UBC-NLP/MARBERT	80.1 %	68.4 %
alger-ia/dziribert	80.3 %	69.3 %

In order to reproduce these results, please install the following requirements:

pip install -r requirements.txt

Then, run the following evaluation script:

python3 evaluate_model.py

These results have been obtained on a Tesla K80 GPU.

Pretrained DziriBERT

DziriBERT has been uploaded to the HuggingFace hub in order to facilitate its use: https://huggingface.co/alger-ia/dziribert.

It can be easily downloaded and loaded using the transformers library:

from transformers import BertTokenizer, BertForMaskedLM

tokenizer = BertTokenizer.from_pretrained("alger-ia/dziribert")
model = BertForMaskedLM.from_pretrained("alger-ia/dziribert")

How to cite

@article{dziribert,
  title={DziriBERT: a Pre-trained Language Model for the Algerian Dialect},
  author={Abdaoui, Amine and Berrimi, Mohamed and Oussalah, Mourad and Moussaoui, Abdelouahab},
  journal={arXiv preprint arXiv:2109.12346},
  year={2021}
}

Contact

Please contact [email protected] for any question, feedback or request.

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Related tags

Overview

DziriBERT

Evaluation

Pretrained DziriBERT

How to cite

Contact

Owner

Autoregressive Entity Retrieval

Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.

This repo is to provide a list of literature regarding Deep Learning on Graphs for NLP

Graph Coloring - Weighted Vertex Coloring Problem

BERTAC (BERT-style transformer-based language model with Adversarially pretrained Convolutional neural network)

p-tuning for few-shot NLU task

Idea is to build a model which will take keywords as inputs and generate sentences as outputs.

An end to end ASR Transformer model training repo

ZUNIT - Toward Zero-Shot Unsupervised Image-to-Image Translation

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

AMUSE - financial summarization

An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

jiant is an NLP toolkit

kochat

Plugin repository for Macast

基于pytorch_rnn的古诗词生成

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code