MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

Last update: Oct 19, 2022

Overview

MILES

Multilingual Lexical Simplifier
Explore the docs »

Read LSBert Paper · Report Bug · Request Feature

About The Project

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking. MILES currently supports 22 languages: Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, English, Finnish, French, German, Hungarian, Indonesian, Italian, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Ukrainian.

As a result of not using any language-specific resources (WordNets, POS taggers, parallel corpora, etc.), MILES does not always offer synonymous substitutions for complex words. Although almost always simpler than the original, selected substitutions may alter the meaning of the text. Please keep this in mind, and feel free to download and tailor MILES to a language of your choosing!

Prerequisites

FastText Embeddings

It is recommended that fastText embeddings are downloaded for your target language/s. These will be used by MILES to make notably more accurate simplifications. To install fastText embeddings for MILES, download the .vec embeddings for you target language here. Once done, place the .vec file in simplifier/embeddings/ before running the key vector generation script with the ISO 639-1 code for the selected language:

python simplifier/embeddings/gen_keyed_vectors.py <ISO 639-1 code>

Usage

Flask App

MILES simplifications can be done using either a simple Flask app provided or the command line. To start using the Flask app, run app.py with ISO 639-1 language code:

python app.py -l <ISO 639-1 code>

Once running, open 127.0.0.1 in your browser and start simplifying!

Command Line

If you would prefer to use the command line, there are a couple of options available:

Simplifying sentences:

python simplify.py -t <sentence> -l <ISO 639-1 code>

Simplifying text files:

python simplify.py -f <text_file> -l <ISO 639-1 code>

Note: If no language code is provided, text will be simplified assuming it's English. The default language can be changed in simplifier/config.py.

Framework

Roadmap

See the open issues for a list of proposed features (and known issues).

Contact

If you have any questions or concerns, message me on LinkedIn or email me at [email protected].

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

Related tags

Overview

MILES

About The Project

Prerequisites

FastText Embeddings

Usage

Flask App

Command Line

Framework

Roadmap

Contact

Owner

Kane

Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

Searching keywords in PDF file folders

A demo of chinese asr

Simple Python library, distributed via binary wheels with few direct dependencies, for easily using wav2vec 2.0 models for speech recognition

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

COVID-19 Related NLP Papers

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Resources for "Natural Language Processing" Coursera course.

Train and use generative text models in a few lines of code.

Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Creating a python chatbot that Starbucks users can text to place an order + help cut wait time of a normal coffee.

本插件是pcrjjc插件的重置版，可以独立于后端api运行

Spooky Skelly For Python

The tool to make NLP datasets ready to use

Biterm Topic Model (BTM): modeling topics in short texts

Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

The first online catalogue for Arabic NLP datasets.

[KBS] Aspect-based sentiment analysis via affective knowledge enhanced graph convolutional networks

Auto translate textbox from Japanese to English or Indonesia

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch