Text language identification using Wikipedia data

The aim of this project is to provide high-quality language detection over all the web's languages. The proxy for all web's languages is Wikipedia. Currently, we support 156 languages that have their Wikipedia entries.

Usage

The main function is text-langs that returns 2 values:

a lang - probability alist (languages are represented by their ISO-639-1 codes)
a vector of tokens with their inferred langs

WILD> (text-langs "це тест")
((:UK . 0.5000003) (:RU . 0.4999998))
#(<це - UK:1.00> <тест - RU:1.00>)

Running as a service

Installation

Install SBCL
Get Quicklisp
Git clone project
$ cd wiki-lang-detect; sbcl --load run.lisp

Running as a Docker

docker build -t wiki-lang-detect:latest .
docker run -it -p 5000:5000 wiki-lang-detect:latest

curl -X POST -H "Content-Type: application/json" -d "{'text': 'Несе Галя'}"  http://localhost:5000/detect | jq '.'

Or you can use prebuilt Docker image maintained outside of this repository.

docker run -it -p 5000:5000 chaliy/wiki-lang-detect:latest

API

See swagger definition

Text language identification using Wikipedia data

Related tags

Overview

Text language identification using Wikipedia data

Usage

Running as a service

Installation

Running as a Docker

API

Helpful links:

Owner

Vsevolod Dyomkin

Python tool that takes the OCR.space JSON output as input and draws a text overlay on top of the image.

MXNet OCR implementation. Including text recognition and detection.

Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

A tool for extracting text from scanned documents (via OCR), with user-defined post-processing.

An interactive document scanner built in Python using OpenCV

ARU-Net - Deep Learning Chinese Word Segment

SceneCollisionNet This repo contains the code for "Object Rearrangement Using Learned Implicit Collision Functions", an ICRA 2021 paper. For more info

Tracking the latest progress in Scene Text Detection and Recognition: Must-read papers well organized

This repository provides train＆test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.

This can be use to convert text in a file to handwritten text.

A bot that plays TFT using OCR. Keeps track of bench, board, items, and plays the user defined team comp.

(CVPR 2021) ST3D: Self-training for Unsupervised Domain Adaptation on 3D Object Detection

PSENet - Shape Robust Text Detection with Progressive Scale Expansion Network.

An Optical Character Recognition system using Pytesseract/Extracting data from Blood Pressure Reports.

This pyhton script converts a pdf to Image then using tesseract as OCR engine converts Image to Text

Document Image Dewarping

OCR engine for all the languages

A curated list of promising OCR resources

Educational application aimed at automating user-defined workflows for the mobile game, "Granblue Fantasy", using a variety of CV technologies in the backend such as OpenCV, PyAutoGUI and EasyOCR and a frontend coded in Typescript.

A Joint Video and Image Encoder for End-to-End Retrieval