MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

Last update: Oct 19, 2022

Overview

MILES

Multilingual Lexical Simplifier
Explore the docs »

Read LSBert Paper · Report Bug · Request Feature

About The Project

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking. MILES currently supports 22 languages: Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, English, Finnish, French, German, Hungarian, Indonesian, Italian, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Ukrainian.

As a result of not using any language-specific resources (WordNets, POS taggers, parallel corpora, etc.), MILES does not always offer synonymous substitutions for complex words. Although almost always simpler than the original, selected substitutions may alter the meaning of the text. Please keep this in mind, and feel free to download and tailor MILES to a language of your choosing!

Prerequisites

FastText Embeddings

It is recommended that fastText embeddings are downloaded for your target language/s. These will be used by MILES to make notably more accurate simplifications. To install fastText embeddings for MILES, download the .vec embeddings for you target language here. Once done, place the .vec file in simplifier/embeddings/ before running the key vector generation script with the ISO 639-1 code for the selected language:

python simplifier/embeddings/gen_keyed_vectors.py <ISO 639-1 code>

Usage

Flask App

MILES simplifications can be done using either a simple Flask app provided or the command line. To start using the Flask app, run app.py with ISO 639-1 language code:

python app.py -l <ISO 639-1 code>

Once running, open 127.0.0.1 in your browser and start simplifying!

Command Line

If you would prefer to use the command line, there are a couple of options available:

Simplifying sentences:

python simplify.py -t <sentence> -l <ISO 639-1 code>

Simplifying text files:

python simplify.py -f <text_file> -l <ISO 639-1 code>

Note: If no language code is provided, text will be simplified assuming it's English. The default language can be changed in simplifier/config.py.

Framework

Roadmap

See the open issues for a list of proposed features (and known issues).

Contact

If you have any questions or concerns, message me on LinkedIn or email me at [email protected].

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

Related tags

Overview

MILES

About The Project

Prerequisites

FastText Embeddings

Usage

Flask App

Command Line

Framework

Roadmap

Contact

Owner

Kane

Chinese segmentation library

⚖️ A Statutory Article Retrieval Dataset in French.

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

PyTorch original implementation of Cross-lingual Language Model Pretraining.

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

The aim of this task is to predict someone's English proficiency based on a text input.

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Yet Another Neural Machine Translation Toolkit

Module for automatic summarization of text documents and HTML pages.

The official implementation of "BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?, ACL 2021 main conference"

A PyTorch implementation of VIOLET

Use the power of GPT3 to execute any function inside your programs just by giving some doctests

Big Bird: Transformers for Longer Sequences

Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Code for the paper "Flexible Generation of Natural Language Deductions"

The RWKV Language Model

Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

基于GRU网络的句子判断程序/A program based on GRU network for judging sentences