The open source extract transaction infomation by using OCR.

Overview

Transaction OCR

Mã nguồn trích xuất thông tin transaction từ file scaned pdf, ở đây tôi lựa chọn tài liệu sao kê công khai của Thuy Tien. Mã nguồn có thể ứng dụng để giải quyết bài toán liên quan đến trích xuất thông tin văn bản từ hình ảnh (OCR - Optical Character Recognition) có cấu trúc nội dung xác định và với độ dài các dòng thông tin (row) bất kì như thông tin giao dịch, hóa đơn mua hàng,... Mã nguồn lựa chọn Cloud Vision API đại diện cho OCR model để có được độ chính xác cao, hoặc bạn có thể sử dụng model có sẵn như Vietocr hoặc có thể tự build custom OCR tiếng Việt từ clovaai: text-detectiontext-recognition) mà tôi cho là khá tốt.

Getting Started

Dependency

git clone https://github.com/hungtooc/transaction_ocr.git

pip install -r requirements.txt

1. Repair data input

1.1 Download raw data

1.2 Convert pdf files to image

PDF password: [email protected]

python tools/pdf-to-images.py --pdf-password [email protected]
usage: pdf-to-images.py [-h] [--pdf-dir PDF_DIR] [--output-dir OUTPUT_DIR] [--pdf-password PDF_PASSWORD] [--from-page-no FROM_PAGE_NO] [--to-page-no TO_PAGE_NO] [--fix-page-number FIX_PAGE_NUMBER]

optional arguments:
  -h, --help            show this help message and exit
  --pdf-dir PDF_DIR     dir to pdf files
  --output-dir OUTPUT_DIR
                        dir to save images
  --pdf-password PDF_PASSWORD
                        pdf password
  --from-page-no FROM_PAGE_NO
                        extra image from page
  --to-page-no TO_PAGE_NO
                        extra image to page
  --fix-page-number FIX_PAGE_NUMBER
                        fix page number (page_no += fix_page_number)

2. Extract transaction information

The source perform the basic steps to extract transaction information, you may want to add additional processing to optimize the source code in lines marked #todo.

python run.py 
usage: run.py [-h] [--image-dir IMAGE_DIR] [--output-respone-dir OUTPUT_RESPONE_DIR] [--output-content-dir OUTPUT_CONTENT_DIR] [--processed-log-file PROCESSED_LOG_FILE]

optional arguments:
  -h, --help            show this help message and exit
  --image-dir IMAGE_DIR
                        dir to images
  --output-respone-dir OUTPUT_RESPONE_DIR
                        dir to save api respone
  --output-content-dir OUTPUT_CONTENT_DIR
                        dir to save transaction content
  --processed-log-file PROCESSED_LOG_FILE
                        path to log file

File run.py perform 7 main stages:

  • Step 1. Find header & footer.
  • Step 2. Re-rotate image based on header-corner.
  • Step 3. Clean image.
  • Step 4. Call request google-ocr api. (include:text-detection & text-recognition
  • Step 5. Detect transaction line.
    processing-step-boder

  • Step 6. Classify transaction content each line & each content type.
    read-transactions-border

  • Step 7. Save transactions content to csv.
TNX Date Doc No Debit Credit Balance Transaction in detail (note)
13/10/2020 5091.55821 100.000 586062.131020.075756.Ung ho mien trung FT20287151644070 page_1
13/10/2020 5091.56080 1.000.000 586279.131020.075829.Ung ho dong bao mien Trung FT20287592192480 page_1
13/10/2020 5091.56138 200.000 219987.131020.075839.Trinh Thi Thu Thuy chuyen tien ung ho mien Trung page_1
13/10/2020 5091.56155 100.000 586295.131020.075826.UH mien trung FT20287432289640 page_1
13/10/2020 5078.68388 500.000 MBVCB.807033343.PHAM THUY TRANG chuyen tien ung ho tu thien.CT tu 0561000606153 PHAM THUY TRANG toi 0181003469746 TRAN THI THUY TIEN page_1
13/10/2020 5091.56261 1.000.000 184997.131020.075853.Em gui giup do ba con vung lu page_1
13/10/2020 5078.68496 200.000 MBVCB.807033583.Ung ho mien trung.CT tu 0051000531310 HUYNH THI NHU Y toi 0181003469746 TRAN THI THUY TIEN page_1
13/10/2020 5078.68526 100.000 MBVCB.807033514.ung ho mien trung.CT tu 0481000903279 NGUYEN THI HUONG AN toi 0181003469746 TRAN THI THUY TIEN page_1
13/10/2020 5091.56381 100.000 479592.131020.075909.ho tro mien trung page_1
13/10/2020 5078.68537 500.000 MBVCB.807034561.Ung ho Mien trung.CT tu 0721000588146 LE THI HONG DIEM toi 0181003469746 TRAN THI THUY TIEN page_1
13/10/2020 5091.56405 200.000 292363.131020.075845.Ngan hang TMCP Ngoai Thuong Viet Nam 0181003469746 LUC NGHIEM LE chuyen khoan ung ho mien trung page_1
13/10/2020 5091.56410 500.000 479627.131020.075913.Ung ho mien trung page_1

3. Export Excel

Export each csv directory to an excel file. Example:

python tools/export-excel.py --csv-dir "data/content/TÀI KHOẢN XXX746 (Pass_ [email protected])/TỪ 13.10.20 ĐẾN 23.11.20/1. TRANG 1 -1000.pdf"
usage: export-excel.py [-h] --csv-dir CSV_DIR [--output-dir OUTPUT_DIR] [--transaction-template TRANSACTION_TEMPLATE] [--filename FILENAME]

optional arguments:
  -h, --help            show this help message and exit
  --csv-dir CSV_DIR     csv dir
  --output-dir OUTPUT_DIR
                        output dir
  --transaction-template TRANSACTION_TEMPLATE
                        dir to save transaction content
  --filename FILENAME   output filename, leave blank to set default

4. Extract dataset

From api responed data, you can extract dataset to train text-recognization model:

 python tools/export-dataset.py 
usage: extract-dataset.py [-h] [--respone-dir RESPONE_DIR] [-a OUTPUT_ANNOTATION] [-i OUTPUT_IMAGE_DIR]

optional arguments:
  -h, --help            show this help message and exit
  --respone-dir RESPONE_DIR
                        dir to api respone
  -a OUTPUT_ANNOTATION, --output-annotation OUTPUT_ANNOTATION
                        path to save annotation file
  -i OUTPUT_IMAGE_DIR, --output-image-dir OUTPUT_IMAGE_DIR
                        path to save annotation file
  • Dataset of first 1000 pages lalebed by google-ocr (~336k): Google Drive
  • Tips: you may want to balance data text type before extract

5. Result

18107 transaction statement pages have been extracted from pdf format: Google Drive - Accuracy >99%.

You might also like...
Go package for OCR (Optical Character Recognition), by using Tesseract C++ library

gosseract OCR Golang OCR package, by using Tesseract C++ library. OCR Server Do you just want OCR server, or see the working example of this package?

Convert PDF/Image to TXT using EasyOcr - the best OCR engine available!
Convert PDF/Image to TXT using EasyOcr - the best OCR engine available!

PDFImage2TXT - DOWNLOAD INSTALLER HERE What can you do with it? Convert scanned PDFs to TXT. Convert scanned Documents to TXT. No coding required!! In

A bot that plays TFT using OCR. Keeps track of bench, board, items, and plays the user defined team comp.
A bot that plays TFT using OCR. Keeps track of bench, board, items, and plays the user defined team comp.

NOTES: To ensure best results, make sure you are running this on a computer that has decent specs. 1920x1080 fullscreen is required in League, game mu

This pyhton script converts a pdf to Image then using tesseract as OCR engine converts Image to Text

Script_Convertir_PDF_IMG_TXT Este script de pyhton convierte un pdf en Imagen luego utilizando tesseract como motor OCR convierte la Imagen a Texto. p

Extract tables from scanned image PDFs using Optical Character Recognition.

ocr-table This project aims to extract tables from scanned image PDFs using Optical Character Recognition. Install Requirements Tesseract OCR sudo apt

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

EasyOCR Ready-to-use OCR with 80+ languages supported including Chinese, Japanese, Korean and Thai. What's new 1 February 2021 - Version 1.2.3 Add set

A Python wrapper for the tesseract-ocr API

tesserocr A simple, Pillow-friendly, wrapper around the tesseract-ocr API for Optical Character Recognition (OCR). tesserocr integrates directly with

FastOCR is a desktop application for OCR API.

FastOCR FastOCR is a desktop application for OCR API. Installation Arch Linux fastocr-git @ AUR Build from AUR or install with your favorite AUR helpe

OCR-D-compliant page segmentation

ocrd_segment This repository aims to provide a number of OCR-D-compliant processors for layout analysis and evaluation. Installation In your virtual e

Owner
Nguyen Xuan Hung
R&D
Nguyen Xuan Hung
Select range and every time the screen changes, OCR is activated.

ASOCR(Auto Screen OCR) Select range and every time you press Space key, OCR is activated. 範囲を選ぶと、あなたがスペースキーを押すたびに、画面が変わる度にOCRが起動します。 usage1: simple OC

1 Feb 13, 2022
Source code of RRPN ---- Arbitrary-Oriented Scene Text Detection via Rotation Proposals

Paper source Arbitrary-Oriented Scene Text Detection via Rotation Proposals https://arxiv.org/abs/1703.01086 News We update RRPN in pytorch 1.0! View

428 Nov 22, 2022
CNN+Attention+Seq2Seq

Attention_OCR CNN+Attention+Seq2Seq The model and its tensor transformation are shown in the figure below It is necessary ch_ train and ch_ test the p

Tsukinousag1 2 Jul 14, 2022
A Vietnamese personal card OCR website built with Django.

Django VietCardOCR Installation Creation of virtual environments is done by executing the command venv: python -m venv venv That will create a new fol

Truong Hoang Thuan 4 Sep 04, 2021
Rotational region detection based on Faster-RCNN.

R2CNN_Faster_RCNN_Tensorflow Abstract This is a tensorflow re-implementation of R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detecti

UCAS-Det 581 Nov 22, 2022
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)

CUTIE TensorFlow implementation of the paper "CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor." Xiaohu

Zhao,Xiaohui 147 Dec 20, 2022
基于openpose和图像分类的手语识别项目

手语识别 0、使用到的模型 (1). openpose,作者:CMU-Perceptual-Computing-Lab https://github.com/CMU-Perceptual-Computing-Lab/openpose (2). 图像分类classification,作者:Bubbl

20 Dec 15, 2022
Go package for OCR (Optical Character Recognition), by using Tesseract C++ library

gosseract OCR Golang OCR package, by using Tesseract C++ library. OCR Server Do you just want OCR server, or see the working example of this package?

Hiromu OCHIAI 1.9k Dec 28, 2022
Repository collecting all the submodules for the new PyTorch-based OCR System.

OCRopus3 is being replaced by OCRopus4, which is a rewrite using PyTorch 1.7; release should be soonish. Please check github.com/tmbdev/ocropus for up

NVIDIA Research Projects 138 Dec 09, 2022
Volume Control using OpenCV

Gesture-Volume-Control Volume Control using OpenCV Here i made volume control using Python and OpenCV in which we can control the volume of our laptop

Mudit Sinha 3 Oct 10, 2021
Code release for Hu et al., Learning to Segment Every Thing. in CVPR, 2018.

Learning to Segment Every Thing This repository contains the code for the following paper: R. Hu, P. Dollár, K. He, T. Darrell, R. Girshick, Learning

Ronghang Hu 417 Oct 03, 2022
[BMVC'21] Official PyTorch Implementation of Grounded Situation Recognition with Transformers

Grounded Situation Recognition with Transformers Paper | Model Checkpoint This is the official PyTorch implementation of Grounded Situation Recognitio

Junhyeong Cho 18 Jul 19, 2022
WACV 2022 Paper - Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching Code based on our WACV 2022 Accepted Paper: https://arxiv.org/pdf/

Andres 13 Dec 17, 2022
Using computer vision method to recognize and calcutate the features of the architecture.

building-feature-recognition In this repository, we accomplished building feature recognition using traditional/dl-assisted computer vision method. Th

4 Aug 11, 2022
Motion Detection Squid Game with OpenCV Python

*Motion Detection Squid Game with OpenCV Python i am newbie in python. In this project I made a simple game to follow the trend about the red light gr

Nayan 17 Nov 22, 2022
Natural language detection

Detect the language of text. What’s so cool about franc? franc can support more languages(†) than any other library franc is packaged with support for

Titus 3.8k Jan 02, 2023
The first open-source library that detects the font of a text in a image.

Typefont Typefont is an experimental library that detects the font of a text in a image. Usage Import the main function and invoke it like in the foll

Vasile Pește 1.6k Feb 24, 2022
Line based ATR Engine based on OCRopy

OCR Engine based on OCRopy and Kraken using python3. It is designed to both be easy to use from the command line but also be modular to be integrated

948 Dec 23, 2022
The world's simplest facial recognition api for Python and the command line

Face Recognition You can also read a translated version of this file in Chinese 简体中文版 or in Korean 한국어 or in Japanese 日本語. Recognize and manipulate fa

Adam Geitgey 47k Jan 07, 2023
It is a image ocr tool using the Tesseract-OCR engine with the pytesseract package and has a GUI.

OCR-Tool It is a image ocr tool made in Python using the Tesseract-OCR engine with the pytesseract package and has a GUI. This is my second ever pytho

Khant Htet Aung 4 Jul 11, 2022