DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Last update: Jan 07, 2023

Related tags

Overview

DziriBERT

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect. It handles Algerian text contents written using both Arabic and Latin characters. It sets new state of the art results on Algerian text classification datasets, even if it has been pre-trained on much less data (~1 million tweets).

The model is publicly available at: https://huggingface.co/alger-ia/dziribert.

For more information, please visit our paper: https://arxiv.org/pdf/2109.12346.pdf

Evaluation

The Twifil dataset was used to compare DziriBERT with current multilingual, standard Arabic and dialectal Arabic models:

Model	Sentiment acc.	Emotion acc.
bert-base-multilingual-cased	73.6 %	59.4 %
aubmindlab/bert-base-arabert	72.1 %	61.2 %
CAMeL-Lab/bert-base-arabic-camelbert-mix	77.1 %	65.7 %
qarib/bert-base-qarib	77.7 %	67.6 %
UBC-NLP/MARBERT	80.1 %	68.4 %
alger-ia/dziribert	80.3 %	69.3 %

In order to reproduce these results, please install the following requirements:

pip install -r requirements.txt

Then, run the following evaluation script:

python3 evaluate_model.py

These results have been obtained on a Tesla K80 GPU.

Pretrained DziriBERT

DziriBERT has been uploaded to the HuggingFace hub in order to facilitate its use: https://huggingface.co/alger-ia/dziribert.

It can be easily downloaded and loaded using the transformers library:

from transformers import BertTokenizer, BertForMaskedLM

tokenizer = BertTokenizer.from_pretrained("alger-ia/dziribert")
model = BertForMaskedLM.from_pretrained("alger-ia/dziribert")

How to cite

@article{dziribert,
  title={DziriBERT: a Pre-trained Language Model for the Algerian Dialect},
  author={Abdaoui, Amine and Berrimi, Mohamed and Oussalah, Mourad and Moussaoui, Abdelouahab},
  journal={arXiv preprint arXiv:2109.12346},
  year={2021}
}

Contact

Please contact [email protected] for any question, feedback or request.

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Related tags

Overview

DziriBERT

Evaluation

Pretrained DziriBERT

How to cite

Contact

Owner

An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

Natural language Understanding Toolkit

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

Watson Natural Language Understanding and Knowledge Studio

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

CYGNUS, the Cynical AI, combines snarky responses with uncanny aggression.

Python bot created with Selenium that can guess the daily Wordle word correct 96.8% of the time.

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

Labelling platform for text using distant supervision

Ecco is a python library for exploring and explaining Natural Language Processing models using interactive visualizations.

Code for "Generative adversarial networks for reconstructing natural images from brain activity".

Modular and extensible speech recognition library leveraging pytorch-lightning and hydra.

A fast and easy implementation of Transformer with PyTorch.

A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. X-Ray supports 18 languages.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

NLP-based analysis of poor Chinese movie reviews on Douban

sangha, pronounced "suhng-guh", is a social networking, booking platform where students and teachers can share their practice.

2021 2학기 데이터크롤링 기말프로젝트