Maha is a text processing library specially developed to deal with Arabic text.

Overview



CI Documentation Status codecov Discord Downloads License PyPI version Code style: black Checked with mypy PyPI - Python Version

An Arabic text processing library intended for use in NLP applications


Maha is a text processing library specially developed to deal with Arabic text. The beta version can be used to clean and parse text, files, and folders with or without streaming capability.

If you need help or want to discuss topics related to Maha, feel free to reach out to our Discord server. If you would like to submit a bug report or feature request, please open an issue.

Installation

Simply run the following to install Maha:

pip install mahad # pronounced maha d

For source installation, check the documentation.

Overview

Check out the overview section in the documentation to get started with Maha.

Documentation

Documentation are hosted at ReadTheDocs.

Contributing

Maha welcomes and encourages everyone to contribute. Contributions are always appreciated. Feel free to take a look at our contribution guidelines in the documentation.

License

Maha is BSD-licensed.

Comments
  • Time: Add the ability to parse Hijri dates

    Time: Add the ability to parse Hijri dates

    What does this pull request change?

    Closes #27.

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    new feature highlight 
    opened by TRoboto 6
  • Added distance to dimension parsing

    Added distance to dimension parsing

    What does this pull request change?

    Resolves #15.

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [x] updated the documentation
    • [x] tox passes
    parsing highlight 
    opened by TRoboto 5
  • Introduce :mod:`~.datasets` module and the first dataset, `names`, with over 40,000 unique names

    Introduce :mod:`~.datasets` module and the first dataset, `names`, with over 40,000 unique names

    What does this pull request change?

    This PR introduces a new datasets module that offers an interface for all upcoming datasets. A new dataset, names, is released along with the module. It comprises 44,161 unique names with descriptions and name origin included for most names.

    Link to updated docs: https://maha--40.org.readthedocs.build/en/40/overview.html#datasets

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [x] updated the documentation
    • [x] tox passes
    new feature highlight 
    opened by TRoboto 4
  • Add pyupgrade to pre-commit and upgrade to future-style type annotations

    Add pyupgrade to pre-commit and upgrade to future-style type annotations

    What does this pull request change?

    Upgrades to new type annotations style.

    Status (please check what you already did):

    • [ ] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    maintenance 
    opened by TRoboto 3
  • Deprecate and remove `datasets` module and host datasets on Hugging Face instead

    Deprecate and remove `datasets` module and host datasets on Hugging Face instead

    What does this pull request change?

    • Removes datasets module.
    • Datasets are now hosted here

    Status (please check what you already did):

    • [ ] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    breaking changes deprecation 
    opened by TRoboto 3
  • Add the ability to parse names from text

    Add the ability to parse names from text

    What does this pull request change?

    Adds #24. Depends on #40

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [x] updated the documentation
    • [x] tox passes
    new feature highlight 
    opened by TRoboto 3
  • Add a deprecation system

    Add a deprecation system

    What does this pull request change?

    • Closes #23
    • Adds 3 deprecation decorators; for functions, for parameters, for default parameters.

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    development 
    opened by saedx1 3
  • Prepare for the next release of Maha (v0.3.0)

    Prepare for the next release of Maha (v0.3.0)

    This is an auto-generated PR to prepare for the next release of Maha. The following changes were automatically made:

    • Generated changelogs for release v0.3.0.
    • Bumped pypi version to v0.3.0.
    • Updated the citation information.
    opened by github-actions[bot] 2
  • Ordinal: Add support to `بعد` in ordinal parsing

    Ordinal: Add support to `بعد` in ordinal parsing

    What does this pull request change?

    Closes #48.

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    new feature 
    opened by TRoboto 2
  • Numeral: Add support for hierarchical parsing

    Numeral: Add support for hierarchical parsing

    What does this pull request change?

    Closes #25

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    new feature 
    opened by TRoboto 2
  • Prepare for the next release of Maha (v0.2.0)

    Prepare for the next release of Maha (v0.2.0)

    This is an auto-generated PR to prepare for the next release of Maha. The following changes were automatically made:

    • Generated changelogs for release v0.2.0.
    • Bumped pypi version to v0.2.0.
    • Updated the citation information.
    opened by github-actions[bot] 2
  • Update ci.yml

    Update ci.yml

    Check the support for python 3,10

    What does this pull request change? It checks if the library is supporting python 3.10.

    • ...

    Status (please check what you already did):

    • [ ] added some tests for the functionality
    • [ ] updated the documentation
    • [ ] tox passes
    opened by PAIN-BARHAM 1
  • Add the option to ignore Harakat when removing or replacing

    Add the option to ignore Harakat when removing or replacing

    What problem are you trying to solve?

    Currently, the cleaner functions do not consider two strings similar if they have different Harakat/diacritics, which is the correct behavior. However, it would be great if the user had the option to ignore Harakat when comparing strings.

    Examples (if relevant)

    Current:

    >> from maha.cleaners.functions import remove
    >> output = remove("يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى", custom_expressions=r"اللغة")
    >> output
    يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى
    

    Suggested:

    >> from maha.cleaners.functions import remove
    >> remove("يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى", custom_expressions=r"اللغة", ignore_harakat=True)
    >> output
    يُدَرِّسُ العَرَبِيَّةَ الفُصْحَى
    

    Definition of Done

    • It must adhere to the coding style used in the defined cleaner functions.
    • The implementation should cover most use cases.
    • Adding tests
    feature request 
    opened by xaleel 1
  • Wrong parsed name using name dimension

    Wrong parsed name using name dimension

    What happened?

    The name parser extracted wrong name likes : بي, شكرا.

    Example: text: أريد البحث في سجل الإنفاق الخاص بي [Dimension(body=بي, value=بي, start=32, end=34, dimension_type=DimensionType.NAME)]

    I expect to extract the names on the name dataset only.

    Python version

    3.8

    What operating system are you using?

    Linux

    Code to reproduce the issue

    >>> from maha.parsers.functions import parse_dimension
    >>> text = `أريد البحث في سجل الإنفاق الخاص بي`
    >>> extracted = parse_dimension(text, names=True)
    [Dimension(body=بي, value=بي, start=32, end=34, dimension_type=DimensionType.NAME)]
    

    Relevant log output

    No response

    bug parsing 
    opened by PAIN-BARHAM 0
  • Add feature to parse duration period

    Add feature to parse duration period

    What problem are you trying to solve?

    Parsing the duration from the text that has the difference between the two dates.

    Examples (if relevant)

    >>> from maha.parsers.functions import parse_dimension
    >>> output = parse_dimension('عن ربع نمو سكان العالم القديم والتحضر بين 1700 و 1900 ميلادي', duration=True)[0].value
    >>> output
    DurationValue(values=[ValueUnit(value=200, unit=<DurationUnit.YEARS: 7>)], normalized_unit=<DurationUnit.SECONDS: 1>)
    
    

    Definition of Done

    • It must adhere to the coding style used in the defined dimensions, duration dimension.
    • The implementation should cover most use cases.
    • Adding tests
    feature request 
    opened by PAIN-BARHAM 1
  • Adding the parser functionality to Processors

    Adding the parser functionality to Processors

    What problem are you trying to solve?

    Adding the parser functionality to Processors to parse different dimensions.

    Examples (if relevant)

    >>> from pathlib import Path
    >>> import maha
    >>> resource_path = Path(maha.__file__).parents[1] / "sample_data/tweets.txt"
    >>> data = resource_path.read_text()
    >>> print(data)
    
    الساعة الآن 12:00 في اسبانيا 🇪🇸, انتهى بشكل رسمي عقد الأسطورة ليو ميسي مع برشلونة . .
    طبعا بكونو حاطين المكيف ع٣ مئوية وخود تقلبات وبرد وحر وCNS وزعيق المراقب وألف نيلة وقر فتحت اشوف درجة الحرارة هتبقي كام يو الامتحان لقيتها ٤٢ والامتحان الساعه ١ فعايز انورماليز اننا ننزل بالفالنه الحمالات Hot fac
    يسعدلي مساكم ❤🌹 شرح كلمة zwa هالمنشور رح تلاقو (zwar) سهل و لذيذ (aber) ناقصو شوية ملح وكزبر #منقو
    مـعلش استحملوني ب الاصفر هالفتره 💛 #ريشـه هههههههه
    لما حد يسالني بتختفي كتير لية =..
    زيِّنوا ليلة الجمع بالصلاة على النَّبِيِّ ﷺ" ❤
    #Windows11 is on the horizon. What feature are you looking forward to
    Get vaccinate #savethesaviour
    Today I am beginning project on 10 days duratio #30daysofcod #DEVCommunit
    
    >>> from maha.processors import FileProcessor
    >>> proc = FileProcessor(resource_path)
    >>> parsed = proc.parse_dimension(time=True)
    [Dimension(body=الساعة الآن 12:00, value=TimeValue(years=0, months=0, days=0, hours=0, minutes=0, seconds=0, hour=12, minute=0, second=0, microsecond=0), start=0, end=17, dimension_type=DimensionType.TIME),
     Dimension(body=الساعه ١, value=TimeValue(hour=1, minute=0, second=0, microsecond=0), start=238, end=246, dimension_type=DimensionType.TIME),
     Dimension(body=ليلة, value=TimeValue(am_pm='PM'), start=491, end=495, dimension_type=DimensionType.TIME)]
    
    

    Definition of Done

    • It must adhere to the coding style.
    • The implementation should cover most use cases.
    • Adding tests.
    good first issue feature request parsing 
    opened by PAIN-BARHAM 0
Releases(v0.3.0)
Owner
Mohammad Al-Fetyani
Machine Learning Engineer
Mohammad Al-Fetyani
BiNE: Bipartite Network Embedding

BiNE: Bipartite Network Embedding This repository contains the demo code of the paper: BiNE: Bipartite Network Embedding. Ming Gao, Leihui Chen, Xiang

leihuichen 214 Nov 24, 2022
Use the power of GPT3 to execute any function inside your programs just by giving some doctests

gptrun Don't feel like coding today? Use the power of GPT3 to execute any function inside your programs just by giving some doctests. How is this diff

Roberto Abdelkader Martínez Pérez 11 Nov 11, 2022
EdiTTS: Score-based Editing for Controllable Text-to-Speech

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Neosapience 99 Jan 02, 2023
HuggingTweets - Train a model to generate tweets

HuggingTweets - Train a model to generate tweets Create in 5 minutes a tweet generator based on your favorite Tweeter Make my own model with the demo

Boris Dayma 318 Jan 04, 2023
Utilities for preprocessing text for deep learning with Keras

Note: This utility is really old and is no longer maintained. You should use keras.layers.TextVectorization instead of this. Utilities for pre-process

Hamel Husain 180 Dec 09, 2022
Source code of the "Graph-Bert: Only Attention is Needed for Learning Graph Representations" paper

Graph-Bert Source code of "Graph-Bert: Only Attention is Needed for Learning Graph Representations". Please check the script.py as the entry point. We

14 Mar 25, 2022
Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

Workshop: Enterprise-Scale NLP with Hugging Face & Amazon SageMaker Earlier this year we announced a strategic collaboration with Amazon to make it ea

Philipp Schmid 161 Dec 16, 2022
Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

Maksim Zhdanov 7 Sep 20, 2022
Weird Sort-and-Compress Thing

Weird Sort-and-Compress Thing A weird integer sorting + compression algorithm inspired by a conversation with Luthingx (it probably already exists by

Douglas 1 Jan 03, 2022
Almost State-of-the-art Text Generation library

Ps: we are adding transformer model soon Text Gen 🐐 Almost State-of-the-art Text Generation library Text gen is a python library that allow you build

Emeka boris ama 63 Jun 24, 2022
:P Some basic stuff I'm gonna use for my upcoming Agile Software Development and Devops

reverse-image-search-py bash script.sh img_name.jpg Requirements pip install requests pip install pyshorteners Dry run [ Sudhanva M 3 Dec 18, 2021

In this project, we compared Spanish BERT and Multilingual BERT in the Sentiment Analysis task.

Applying BERT Fine Tuning to Sentiment Classification on Amazon Reviews Abstract Sentiment analysis has made great progress in recent years, due to th

Alexander Leonardo Lique Lamas 5 Jan 03, 2022
Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks

TestRank in Pytorch Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks by Yu Li, Min Li, Qiuxia Lai, Ya

3 May 19, 2022
Codename generator using WordNet parts of speech database

codenames Codename generator using WordNet parts of speech database References: https://possiblywrong.wordpress.com/2021/09/13/code-name-generator/ ht

possiblywrong 27 Oct 30, 2022
Score-Based Point Cloud Denoising (ICCV'21)

Score-Based Point Cloud Denoising (ICCV'21) [Paper] https://arxiv.org/abs/2107.10981 Installation Recommended Environment The code has been tested in

Shitong Luo 79 Dec 26, 2022
A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Ethan 66 Dec 26, 2022
Continuously update some NLP practice based on different tasks.

NLP_practice We will continuously update some NLP practice based on different tasks. prerequisites Software pytorch = 1.10 torchtext = 0.11.0 sklear

0 Jan 05, 2022
Multilingual Emotion classification using BERT (fine-tuning). Published at the WASSA workshop (ACL2022).

XLM-EMO: Multilingual Emotion Prediction in Social Media Text Abstract Detecting emotion in text allows social and computational scientists to study h

MilaNLP 35 Sep 17, 2022
MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data.

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data. It is implemented using Python.

willow 6 Jun 27, 2022
A machine learning model for analyzing text for user sentiment and determine whether its a positive, neutral, or negative review.

Sentiment Analysis on Yelp's Dataset Author: Roberto Sanchez, Talent Path: D1 Group Docker Deployment: Deployment of this application can be found her

Roberto Sanchez 0 Aug 04, 2021