Generic framework for historical document processing

Last update: Dec 24, 2022

Overview

dhSegment

dhSegment is a tool for Historical Document Processing. Its generic approach allows to segment regions and extract content from different type of documents. See some examples here.

The complete description of the system can be found in the corresponding paper.

It was created by Benoit Seguin and Sofia Ares Oliveira at DHLAB, EPFL.

Installation and usage

The installation procedure and examples of usage can be found in the documentation (see section below).

Demo

Have a try at the demo to train (optional) and apply dhSegment in page extraction using the demo.py script.

Documentation

The documentation is available on readthedocs.

If you are using this code for your research, you can cite the corresponding paper as :

@inproceedings{oliveiraseguinkaplan2018dhsegment,
  title={dhSegment: A generic deep-learning approach for document segmentation},
  author={Ares Oliveira, Sofia and Seguin, Benoit and Kaplan, Frederic},
  booktitle={Frontiers in Handwriting Recognition (ICFHR), 2018 16th International Conference on},
  pages={7--12},
  year={2018},
  organization={IEEE}
}

Comments

How to use multilabel prediction type?
when i change prediction_type from CLASSIFICATION' to 'MULTILABEL

result.shape[1] > 3, "The number of columns should be greater in multi-label framework"

so how to use multi-label?

Thanks!
opened by duchengyao 32
Can't be installed under Windows
dhSegment is AWESOME and EXACTLY what my wife and I need for our post-cancer #PayItForward Bonus Round activity doing grassroots #CitizenScience #digitalhumanities research in support of eResearch and machine-learning in the domain of digitization of serial publications, primarily modern commercial magazines. We are working on the development of the #MAGAZINEgts ground-truth storage format providing standards-based (#cidocCRM/FRBRoo/PRESSoo) integrated complex document structure and content depiction models.

When a tweet about dhSegment surfaced through my feed, I could barely contain myself... we have detailed, multi-valued metadata -- based on a metamodel of fine-grained use of PRESSoo's Issuing Rules -- that describe the location, bounding box, size, shape, number of colors, products featured, etc. for 7,157 advertisements appearing in the 48 issues of Softalk magazine (https://archive.org/details/softalkapple). It will be trivial for me to generate the annotated label images for all these ads as we have already programmatically extracted the ad sub-images from the full pages once we used our "Ad Ferret" to discovery and curate the specification for every ad.

Once we have a dhSegment instance trained on the Softalk ads, there are over 1.5M pages just within the "collection of collections" of computer magazines at the Internet Archive, and many millions more pages of content in magazines of all types over considerable time periods of their serial publication. The #MAGAZINEgts format, together with brilliant technical achievements like dhSegment, can open new levels of scholarship and machine access to digital collections. We believe dhSegment will be a valuable component for our research platform/framework.

With great excitement I chased down and have installed and tested the prerequisite CUDA and cuDNN frameworks/platforms under Windows. I have these features now working at the 9.1 version. (This alone was tricky, but I got it working.)

Unfortunately, the current implementation of the incredibly important dhSegment environment cannot be installed under Windows 10. After the stock Anaconda environment yml file died somewhat dramatically, I then took that file and attempted to search for and install each package individually. (NOTE: I am not a Python expert, so what I report here is subject to refinement by someone who knows better...) Here is what is NOT available under Windows:

# Python packages for dh_segment not available under Windows - dbus=1.12.2 - fontconfig - glib=2.53.6 - gmp=6.1.2 - graphite2=1.3.10 - gst-plugins-base - gstreamer=1.12.4 - harfbuzz=1.7.4 - jasper=1.900.1 - libedit=3.1 - libffi=3.2.1 - libgcc-ng=7.2.0 - libgfortran-ng=7.2.0 - libopus=1.2.1 - libstdcxx-ng=7.2.0 - libvpx=1.6.1 - ncurses=6.0 - ptyprocess=0.5.2 - readline=7.0 - pip: - tensorflow-gpu==1.4.1 (I did find and installed 1.8.0 instead)

Anything not on this list made it into my Windows-based Anaconda environment, the yml for which I have included here as a file attachment.

win10_dh_segment.yml.txt

I am so disappointed to not be able to install and use dhSegment under Windows. While a docker image would likely be possible to create, I am skeptical that it would work at the level needed for interfacing with the NVIDIA hardware and its CUDA/cuDNN frameworks, etc. Alternatively, perhaps a cloud-based dev platform would work for us (that is affordable as we are independent and unfunded #CitizenScientists). Your workaround/alternative suggestions are welcome.

At any rate, sorry for the overly long initial issue posting. But I wanted to explain my and my wife's great interest in this important technology as well as provide what I hope is useful feedback with regard to its potential use under Windows. Looking forward, I am very interested in evolving a collaborative relationship with you good folks of DHLAB.

ITMT, I am going to generate the labeled training images. :-)

Happy-Healthy Vibes, FactMiner Jim

P.S. Here is our #DATeCH2017 poster that will further explain the focus of our research.

P.P.S. And here is a screenshot showing a typical metadata "spec" for an ad. The simple integer value for the AdLocation is used in concert with an embedded DSL in the fine-grained Issuing Rules of the Advertising Model. This DSL provides a resolution-independent means to describe and compute the upper-left and bounding box of an ad. For example, the four locations of a 1/4 pg sized ad on a page with a 2-column page grid are numbered 1-4, left-to-right top-to-bottom. The proportions of these page segments based on simple geometric proportional computations.

And finally, the evolving #MAGAZINEgts for the Softalk magazine collection at the Internet Archive is available here: https://archive.org/download/softalkapple/softalkapple_publication.xml
opened by Jim-Salmons 9
detecting multiple instances of same object

Like the way this page shows multiple ornament extraction on same page, My model never detects more than one instance of a similar object.

I am using the same demo.py as in master branch.

Can someone help me ?

opened by ankur7721 4
HOW TO USE IT ON TF SERVING BATCH PREDICTION

I have retrained the model using my own dataset, but when I try to get prediction using TF serving using gRPC API call I am not able to pass the images in a batch, it gives out dimensions error but when I pass single image I am able to get predictions. can some help with me on using this model on batch prediction when served.

opened by anish9 4
Original Training image with XML labels to extract data from documents

Hi,

I'm working in a page layout analysis and information extractor and I found that dhSegment might work ok in this task. However, I don't know exactly if dhSegment can work with XML-based anotations (TextRegion, SeparatorRegion, TableRegion, ImageRegion, points defining bounds of each region...) for training besides the RGB styled section definitions. I see in the main page of the project that there is a Layout Analysis example under Use Cases section. That is the case that most resembles to the one I want to implement. Also, I want to extract text from the detected regions.

How can I do that? Can I still use dhSegment or I have to implement my own detector?

Thanks.

Regards.

opened by Omua 4
PredictionType.CLASSIFICATION and extracting rectangles
I am attempting CLASSIFICATION now, not MULTILABEL (issue https://github.com/dhlab-epfl/dhSegment/issues/29 was helpful in mentioning that mutually-exclusive areas mean classification, not multilabel. This is clear in retrospect ;^)

Now I need to extract rectangles and I have hit a big gap in dhSegment. The demo.py code shows how to generate the rectangle corresponding to a skewed page, but there is only one class. I modified demo.py to identify rectangles for each label. When there are multiple classes, there can be spurious, overlapping rectangles.

How can I:

Identify the highest confidence class instances

That are not overlapping

The end result I want is one or more jpegs associated with a particular class label plus the coordinates within the input image.

Perhaps the labels plane in the prediction result offers some help here? demo.py does not use the labels plane.
opened by tralfamadude 3
Feature/table cells

The PAGE-XML functionality has been extented in order to be able to create TableCell elements. Also an error was fixed which occured when trying to transform a list to points.

opened by CrazyCrud 3
Need a short guide of layout detection and line detection

Hello, I have a large collection of scans of written text in table forms with complex layout structure and printed only vertical borders. My plan the a segmentation table rows cell by cell ,line detection inside each cell and then a trial of recognition. I passed through dhSegment demo,it'sok but met problems with operations. Could you please provide any examples of use cases described in the overview https://dhlab-epfl.github.io/dhSegment/ ? I'm ready to label training dataset from my collection but cannot get a start. Any notebook or video guide? One more question is about READ-BAD dataset that was suggested in a couple of issues discussions. I see the article PDF in arxive.org but didn't find a link to download the image collection. What did I miss?

opened by longwall 3
A more efficient neural architecture

@solivr @SeguinBe Thank you for your hard work,

Can you merge Mobilenet v2 with master, along with adding a demo for using it. Thank you

Waiting for your reply

opened by mrocr 3
Convert generated VIA binary masks (black and white) into RGB expected format

First, thanks for your work !

I tried to create masks from VIA project file (doc here). It works but how to convert the black and white generated masks into RGB masks (with classes.txt) ?

I may have missed something but I did not find the code to do it.

Thanks for your help !

opened by loic001 3
Model loading/training error

When executing the following command: python train.py with demo/demo_config.json I get this error. FYI I've followed the installation instructions with conda.

InternalError (see above for traceback): cuDNN launch failure : input shape([1,3,1095,538]) filter shape([7,7,3,64]) [[{{node resnet_v1_50/conv1/Conv2D}} = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/resnet_v1_50/conv1/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, resnet_v1_50/conv1/weights/read)]]

opened by lvaleriu 3
Suggest to loosen the dependency on sacred

Hi, your project dhSegment(commit id: cca94e94aec52baa9350eaaa60c006d7fde103b7) requires "sacred==0.7.4" in its dependency. After analyzing the source code, we found that the following versions of sacred can also be suitable, i.e., sacred 0.7.3, since all functions that you directly (1 APIs: sacred.experiment.Experiment.init) or indirectly (propagate to 19 sacred's internal APIs and 17 outsider APIs) used from the package have not been changed in these versions, thus not affecting your usage.

Therefore, we believe that it is quite safe to loose your dependency on sacred from "sacred==0.7.4" to "sacred>=0.7.3,<=0.7.4". This will improve the applicability of dhSegment and reduce the possibility of any further dependency conflict with other projects.

May I pull a request to further loosen the dependency on sacred?

By the way, could you please tell us whether such an automatic tool for dependency analysis may be potentially helpful for maintaining dependencies easier during your development?

opened by Agnes-U 0
Performance issue in the definition of model_fn, dh_segment/estimator_fn.py(P1)

Hello, I found a performance issue in the definition of model_fn, dh_segment/estimator_fn.py, tf.cast(tf.shape(network_output)[1:3] will be calculated repeatedly during program execution, resulting in reduced efficiency. I think it should be created before the loop.

Looking forward to your reply. Btw, I am very glad to create a PR to fix it if you are too busy.

opened by DLPerf 0
Tensorflow 2.4 (request for permission to upgrade this repo to this)

Hi!

I have locally upgraded this repo to Tensorflow 2.4.1. I thought it might be helpful if I shared this code with you. If you would like I can create a pull request with this update for the repo, I would just need permissions to do so. Let me know!

Toby

opened by tobyDickinson 3
Reproducing baseline detection results

Hello,

I'm trying to reproduce the baseline detection results in your paper. What was the training/validation split used? Also, is it the case that demo/demo_cbad_config.json is the same configuration used to achieve your results? Thank you!

opened by jason-vega 0
Speed of Inference on GeForce GTX 1080

My testing based on a variation of demo.py for classification of 7 labels/classes is showing choppy performance on a GPU. Excluding python post-processing and ignoring the first two inferences, I see processing durations like 0.09, 0.089, 0.56, 0.56, 0.079, 0.39, 0.09 ... ; average over 19 images is 0.19sec per image.

I'm surprised by the variance.

At 5/sec it is workable, but could be better. Would tensorflow-serving help by getting python out of the loop? I need to process 1M images per day.

(The GPU is GeForce GTX 1080 and is using 10.8GB of 11GB RAM, only one TF session is used for multiple inferences.)

opened by tralfamadude 1
Mulilabel limitation should be documented

Only 7 labels are supported and this is not documented. Since effort can be expended to prepare training data, finding out this limitation when running train.py is wasteful.

opened by tralfamadude 1

Releases(v0.2)

v0.2(Apr 3, 2018)

Source code(tar.gz)
Source code(zip)
model.zip(116.78 MB)
pages.zip(245.20 MB)
resnet_v1_50.ckpt(97.75 MB)

Owner

Digital Humanities Laboratory

GitHub Repository https://dhlab-epfl.github.com/dhSegment

Fine tuning keras-ocr python package with custom synthetic dataset from scratch

OCR-Pipeline-with-Keras The keras-ocr package generally consists of two parts: a Detector and a Recognizer: Detector is responsible for creating bound

1 Jan 05, 2022

🔎 Like Chardet. 🚀 Package for encoding & language detection. Charset detection.

Charset Detection, for Everyone 👋 The Real First Universal Charset Detector A library that helps you read text from an unknown charset encoding. Moti

332 Dec 31, 2022

Recognizing cropped text in natural images.

ASTER: Attentional Scene Text Recognizer with Flexible Rectification ASTER is an accurate scene text recognizer with flexible rectification mechanism.

681 Jan 02, 2023

A toolbox of scene text detection and recognition

FudanOCR This toolbox contains the implementations of the following papers: Scene Text Telescope: Text-Focused Scene Image Super-Resolution [Chen et a

170 Dec 26, 2022

AdvancedEAST is an algorithm used for Scene image text detect, which is primarily based on EAST, and the significant improvement was also made, which make long text predictions more accurate.https://github.com/huoyijie/raspberrypi-car

AdvancedEAST AdvancedEAST is an algorithm used for Scene image text detect, which is primarily based on EAST:An Efficient and Accurate Scene Text Dete

1.2k Dec 29, 2022

A buffered and threaded wrapper for the OpenCV VideoCapture object. Can speed up video decoding significantly. Supports

A buffered and threaded wrapper for the OpenCV VideoCapture object. Can speed up video decoding significantly. Supports "with"-syntax.

0 Oct 30, 2021

A python programusing Tkinter graphics library to randomize questions and answers contained in text files

RaffleOfQuestions Um programa simples em python, utilizando a biblioteca gráfica Tkinter para randomizar perguntas e respostas contidas em arquivos de

1 Dec 16, 2021

基于图像识别的开源RPA工具，理论上可以支持所有windows软件和网页的自动化

SimpleRPA 基于图像识别的开源RPA工具，理论上可以支持所有windows软件和网页的自动化简介 SimpleRPA是一款python语言编写的开源RPA工具（桌面自动控制工具），用户可以通过配置yaml格式的文件，来实现桌面软件的自动化控制，简化繁杂重复的工作，比如运营人员给用户发消息，

7 Jun 26, 2022

Awesome multilingual OCR toolkits based on PaddlePaddle （practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices）

English | 简体中文 Introduction PaddleOCR aims to create multilingual, awesome, leading, and practical OCR tools that help users train better models and a

27.5k Jan 08, 2023

A Python wrapper for Google Tesseract

Python Tesseract Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and "read" the text embedded i

4.6k Jan 06, 2023

A curated list of awesome synthetic data for text location and recognition

awesome-SynthText A curated list of awesome synthetic data for text location and recognition and OCR datasets. Text location SynthText SynthText_Chine

283 Jan 05, 2023

【Auto】原神⭐钓鱼辅助工具 | 自动收竿、校准游标 | ✨您只需要抛出鱼竿，我们会帮你完成一切✨

原神钓鱼辅助工具 ✨ 作者正在努力重构代码中……会尽快带给大家一个更完美的脚本 ✨ 「您只需抛出鱼竿，然后我们会帮您搞定一切」如果你觉得这个脚本好用，请点一个 Star ⭐ ，你的 Star 就是作者更新最大的动力点击这里查看演示视频 ✨ 欢迎大家在 Issues 中分享自己的配置文件 ✨ ✨

261 Jan 02, 2023

Bu uygulamada Python ve Opencv kullanarak bilgisayar kamerasından yüz tespiti yapıyoruz.

opencv_yuz_bulma Bu uygulamada Python ve Opencv kullanarak bilgisayar kamerasından yüz tespiti yapıyoruz. Bilgisarın kendi kamerasını kullanmak için;

6 Apr 16, 2022

Repository collecting all the submodules for the new PyTorch-based OCR System.

OCRopus3 is being replaced by OCRopus4, which is a rewrite using PyTorch 1.7; release should be soonish. Please check github.com/tmbdev/ocropus for up

138 Dec 09, 2022

WACV 2022 Paper - Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching Code based on our WACV 2022 Accepted Paper: https://arxiv.org/pdf/

13 Dec 17, 2022

Text modding tools for FF7R (Final Fantasy VII Remake)

FF7R_text_mod_tools Subtitle modding tools for FF7R (Final Fantasy VII Remake) There are 3 tools I made. make_dualsub_mod.exe: Merges (or swaps) subti

10 Dec 19, 2022

Python-based tools for document analysis and OCR

ocropy OCRopus is a collection of document analysis programs, not a turn-key OCR system. In order to apply it to your documents, you may need to do so

3.2k Dec 31, 2022

Use Youdao OCR API to covert your clipboard image to text.

Alfred Clipboard OCR 注：本仓库基于 oott123/alfred-clipboard-ocr 的逻辑用 Python 重写，换用了有道 AI 的 API，准确率更高，有效防止百度导致隐私泄露等问题，并且有道 AI 初始提供的 50 元体验金对于其资费而言个人用户基本可以永久使用

6 Sep 19, 2022

Generic framework for historical document processing

dhSegment dhSegment is a tool for Historical Document Processing. Its generic approach allows to segment regions and extract content from different ty

343 Dec 24, 2022

PyQT5 app that colorize black & white pictures using CNN(use pre-trained model which was made with OpenCV)

About PyQT5 app that colorize black & white pictures using CNN(use pre-trained model which was made with OpenCV) Colorizor Приложение для проекта Yand

1 Apr 04, 2022

Generic framework for historical document processing

Related tags

Overview

dhSegment

Installation and usage

Demo

Documentation

Comments

Releases(v0.2)

v0.2(Apr 3, 2018)

Owner

Digital Humanities Laboratory

Fine tuning keras-ocr python package with custom synthetic dataset from scratch

🔎 Like Chardet. 🚀 Package for encoding & language detection. Charset detection.

Recognizing cropped text in natural images.

A toolbox of scene text detection and recognition

AdvancedEAST is an algorithm used for Scene image text detect, which is primarily based on EAST, and the significant improvement was also made, which make long text predictions more accurate.https://github.com/huoyijie/raspberrypi-car

A buffered and threaded wrapper for the OpenCV VideoCapture object. Can speed up video decoding significantly. Supports

A python programusing Tkinter graphics library to randomize questions and answers contained in text files

基于图像识别的开源RPA工具，理论上可以支持所有windows软件和网页的自动化

Awesome multilingual OCR toolkits based on PaddlePaddle （practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices）

A Python wrapper for Google Tesseract

A curated list of awesome synthetic data for text location and recognition

【Auto】原神⭐钓鱼辅助工具 | 自动收竿、校准游标 | ✨您只需要抛出鱼竿，我们会帮你完成一切✨

Bu uygulamada Python ve Opencv kullanarak bilgisayar kamerasından yüz tespiti yapıyoruz.

Repository collecting all the submodules for the new PyTorch-based OCR System.

WACV 2022 Paper - Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Text modding tools for FF7R (Final Fantasy VII Remake)

Python-based tools for document analysis and OCR

Use Youdao OCR API to covert your clipboard image to text.

Generic framework for historical document processing

PyQT5 app that colorize black & white pictures using CNN(use pre-trained model which was made with OpenCV)