Lingtrain Alignment Studio is an ML based app for texts alignment on different languages.

Overview

Lingtrain Alignment Studio

asd

Intro

Lingtrain Alignment Studio is the ML based app for accurate texts alignment on different languages.

  • Extracts parallel corpora from two texts.
  • Makes the formatted parallel book from it with sentence highlightning.

Models

Automated alignment process relies on the sentence embeddings models. Embeddings are multidimensional vectors of a special kind which are used to calculate a distance between the sentences. You can also plug your own model using the interface described in models directory. Supported languages list depend on the selected backend model.

  • distiluse-base-multilingual-cased-v2
    • more reliable and fast
    • moderate weights size — 500MB
    • supports 50+ languages
    • full list of supported languages can be found in this paper
  • LaBSE (Language-agnostic BERT Sentence Embedding)
    • can be used for rare languages
    • pretty heavy weights — 1.8GB
    • supports 100+ languages
    • full list of supported languages can be found here

Running on local machine

You can run the application on your computer using docker.

  1. Make sure that docker is installed by typing the docker version command in your console.

  2. Images configured to run locally are available on Docker Hub.

  3. Run the following commads in your console:

    • docker pull lingtrain/aligner:v6
    • docker run -v C:\app\data:/app/data -v C:\app\img:/app/static/img -p 80:80 lingtrain/aligner:v6
    • Use lingtrain/aligner:v6-labse for LaBSE version (109 languages).
  4. App will be available in your browser on the localhost address.

  5. If you need to run the container on another port (e.g. localhost:8081):

    • Change the API_URL parameter in config.js
    • Rebuild the docker container
    • Start it with changed -p parameter (e.g. -p 8081:80)

Running in development mode

Clone this repo on your machine.

Backend

Flask/uwsgi backend REST API service. It's pretty simple and contains all the alignment logic.

cd /be python main.py

Frontend

SPA. Vue + vuex + vuetify. UI for managing alignment process using BE and a tool for translators to edit processing documents.

cd /fe

Setup

npm install

Compile and run with hot-reloads for development

npm run serve

Feedback

You can crate an issue or send me a message in telegram: @averkij

License

This work is licensed under a Attribution-NonCommercial-NoDerivatives 4.0 International license. See LICENSE.

Creative Commons License

Owner
Sergei Averkiev
Software Engineer. Eager to learn languages and machine learning approaches. Live in Moscow.
Sergei Averkiev
Python/Sage Tool for deriving Scattering Matrices for WDF R-Adaptors

R-Solver A Python tools for deriving R-Type adaptors for Wave Digital Filters. This code is not quite production-ready. If you are interested in contr

8 Sep 19, 2022
2021 Machine Learning Security Evasion Competition

2021 Machine Learning Security Evasion Competition This repository contains code samples for the 2021 Machine Learning Security Evasion Competition. P

Fabrício Ceschin 8 May 01, 2022
fastFM: A Library for Factorization Machines

Citing fastFM The library fastFM is an academic project. The time and resources spent developing fastFM are therefore justified by the number of citat

1k Dec 24, 2022
My capstone project for Udacity's Machine Learning Nanodegree

MLND-Capstone My capstone project for Udacity's Machine Learning Nanodegree Lane Detection with Deep Learning In this project, I use a deep learning-b

Michael Virgo 407 Dec 12, 2022
A toolkit for geo ML data processing and model evaluation (fork of solaris)

An open source ML toolkit for overhead imagery. This is a beta version of lunular which may continue to develop. Please report any bugs through issues

Ryan Avery 4 Nov 04, 2021
Simple and flexible ML workflow engine.

This is a simple and flexible ML workflow engine. It helps to orchestrate events across a set of microservices and create executable flow to handle requests. Engine is designed to be configurable wit

Katana ML 295 Jan 06, 2023
Turns your machine learning code into microservices with web API, interactive GUI, and more.

Turns your machine learning code into microservices with web API, interactive GUI, and more.

Machine Learning Tooling 2.8k Jan 02, 2023
BigDL: Distributed Deep Learning Framework for Apache Spark

BigDL: Distributed Deep Learning on Apache Spark What is BigDL? BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can w

4.1k Jan 09, 2023
This repo implements a Topological SLAM: Deep Visual Odometry with Long Term Place Recognition (Loop Closure Detection)

This repo implements a topological SLAM system. Deep Visual Odometry (DF-VO) and Visual Place Recognition are combined to form the topological SLAM system.

Best of Australian Centre for Robotic Vision (ACRV) 32 Jun 23, 2022
slim-python is a package to learn customized scoring systems for decision-making problems.

slim-python is a package to learn customized scoring systems for decision-making problems. These are simple decision aids that let users make yes-no p

Berk Ustun 37 Nov 02, 2022
50% faster, 50% less RAM Machine Learning. Numba rewritten Sklearn. SVD, NNMF, PCA, LinearReg, RidgeReg, Randomized, Truncated SVD/PCA, CSR Matrices all 50+% faster

[Due to the time taken @ uni, work + hell breaking loose in my life, since things have calmed down a bit, will continue commiting!!!] [By the way, I'm

Daniel Han-Chen 1.4k Jan 01, 2023
Python based GBDT implementation

Py-boost: a research tool for exploring GBDTs Modern gradient boosting toolkits are very complex and are written in low-level programming languages. A

Sberbank AI Lab 20 Sep 21, 2022
Python implementation of the rulefit algorithm

RuleFit Implementation of a rule based prediction algorithm based on the rulefit algorithm from Friedman and Popescu (PDF) The algorithm can be used f

Christoph Molnar 326 Jan 02, 2023
A Time Series Library for Apache Spark

Flint: A Time Series Library for Apache Spark The ability to analyze time series data at scale is critical for the success of finance and IoT applicat

Two Sigma 970 Jan 04, 2023
Production Grade Machine Learning Service

This project is made to help you scale from a basic Machine Learning project for research purposes to a production grade Machine Learning web service

Abdullah Zaiter 10 Apr 04, 2022
Binary Classification Problem with Machine Learning

Binary Classification Problem with Machine Learning Solving Approach: 1) Ultimate Goal of the Assignment: This assignment is about solving a binary cl

Dinesh Mali 0 Jan 20, 2022
Covid-polygraph - a set of Machine Learning-driven fact-checking tools

Covid-polygraph, a set of Machine Learning-driven fact-checking tools that aim to address the issue of misleading information related to COVID-19.

1 Apr 22, 2022
Lseng-iseng eksplor Machine Learning dengan menggunakan library Scikit-Learn

Kalo dengar istilah ML, biasanya rada ambigu. Soalnya punya beberapa kepanjangan, seperti Mobile Legend, Makan Lontong, Ma**ng L*v* dan lain-lain. Tapi pada repo ini membahas Machine Learning :)

Alfiyanto Kondolele 1 Apr 06, 2022
Practical Time-Series Analysis, published by Packt

Practical Time-Series Analysis This is the code repository for Practical Time-Series Analysis, published by Packt. It contains all the supporting proj

Packt 325 Dec 23, 2022