darija <-> english dictionary

Last update: Jan 01, 2023

Related tags

Overview

darija-dictionary

Having advanced IT solutions that are well adapted to the Moroccan context passes inevitably through understanding Moroccan dialect. Hence, darija (Moroccan dialect) should be an active player in the domain of Natural Language Processing (NLP).

However, it turns out that step 0 in any serious engagement with darija in NLP will consist of translating its vocabulary to the widely used and most documented language in this field, namely English.

This open source project aims to be a reference in addressing this issue. We hope for the contribution of the Moroccan IT community in order to build up the largest dataset of darija-english vocabulary which will serve as a pedestal for any future application of NLP to benefit Moroccan people.

How to contribute

We've made a tutorial for you in DODa's website

Guidelines / Recommendations

3ndk ح dir ح xD (shout-out to this guy 😆 ), often try to use:

darija	3	7	9	8	2 - 'a' - 'i'	5 - 'kh'
arabic	ع	ح	ق	ه	همزة	خ

Try to use capitalization to differentiate between the following letters:

t	T	s	S	d	D
ت	ط	س	ص	د	ض

Arabic characters with two-letters Latin equivalent:

Arabic alphabet	ش	غ	خ
Latin alphabet	ch	gh	kh

Double characters to refer to the emphasis or "الشدة":

darija	7mam	7mmam
english	pigeons	bathroom

We usually don't add "e" in the end of darija words : louz instead of louze
We usually don't use "Z" or "th" for ظ ، ذ ، ث , because we generally don't use these letters in darija (except in northern Morocco, but for the sake of simplicity, we are focusing primarily on standard darija)
We do NOT use apostrophes. In fact, since we are working on csv files, apostrophes will break off words
We use spaces as word delimiters, not _ nor - : thank you instead of thank_you
Respect the number of columns in every row you add, you can use empty quotation marks "" in case you don't have extra variations
In every row, always start with the most used form (in your opinion of course) of the word in question
For future use of this dataset to train deep neural networks, try to reserve each row to similar variations of the same word. For instance, "sou9" and "marchi" both translate to "market", yet it's better to separate them into two different rows:

"sou9","souk","souq","market"

"marchi","","","market"

verbs.csv: The darija translation is reserved to the past tense of the third pronoun "he", whereas the other pronouns and tenses are handled in separate files. The English translation present the basic form (or root) of the English verb.

"ghnna","ghenna","ghanna","","","","sing"

masculine_feminine_plural.csv: If it does exist, feminine-plural translation column is for nouns. Regarding adjectives feminine-plural = feminine.

Citation

@misc{outchakoucht2021moroccan,
      title={Moroccan Dialect -Darija- Open Dataset},
      author={Aissam Outchakoucht and Hamza Es-Samaali},
      year={2021},
      eprint={2103.09687},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

darija <-> english dictionary

Related tags

Overview

darija-dictionary

How to contribute

Guidelines / Recommendations

Citation

Owner

DODa

Learning an Adaptive Meta Model-Generator for Incrementally Updating Recommender Systems

A PyTorch implementation of the paper "Semantic Image Synthesis via Adversarial Learning" in ICCV 2017

A project that uses optical flow and machine learning to detect aimhacking in video clips.

PCGNN - Procedural Content Generation with NEAT and Novelty

SSL_SLAM2: Lightweight 3-D Localization and Mapping for Solid-State LiDAR (mapping and localization separated) ICRA 2021

OpenVisionAPI server

Cobalt Strike teamserver detection.

Semantic Edge Detection with Diverse Deep Supervision

Complex-Valued Neural Networks (CVNN)Complex-Valued Neural Networks (CVNN)

Turn based roguelike in python

Code for Neurips2021 Paper "Topology-Imbalance Learning for Semi-Supervised Node Classification".

Minimal PyTorch implementation of YOLOv3

Toward Multimodal Image-to-Image Translation

A curated (most recent) list of resources for Learning with Noisy Labels

[ACM MM 2021] Joint Implicit Image Function for Guided Depth Super-Resolution

An OpenAI Gym environment for Super Mario Bros

Curriculum Domain Adaptation for Semantic Segmentation of Urban Scenes, ICCV 2017

The pytorch implementation of SOKD (BMVC2021).

SegNet-Basic with Keras

PyTorch implementation of SQN based on CloserLook3D's encoder