Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)

Last update: Dec 20, 2022

Overview

ocr-fileformat

Validate and transform between OCR file formats (hOCR, ALTO, PAGE, FineReader)

Installation
- Docker
- System-wide
Usage
- CLI
- GUI
- API
Transformation
Validation
License

Installation

Docker

You can run the command line scripts and web interface as a Docker container, you only need Docker installed.

To start the web interface on http://localhost:8080:

docker run --rm -it -p 8080:8080 ubma/ocr-fileformat

To run the command line scripts, mount the directory containing your input files into the container's /data directory:

docker run --rm -it -v "$PWD":/data ubma/ocr-fileformat ocr-transform alto2.0 hocr somefile.alto

System-wide

To install system-wide to /usr/local:

sudo make install

To install without sudo to your home directory:

make install PREFIX=$HOME/.local

If $HOME/.local/bin is not in your PATH, add this to your shell startup file (e.g. ~/.bashrc or ~/.zshrc):

export PATH="$HOME/.local/bin $PATH"

The web application has a PHP backed. You can deploy it on any PHP-capable server by copying the web folder somewhere below the document root of your server, e.g. /var/www/html for Apache on Debian/Ubuntu:

sudo -u www-data cp -r web /var/www/html/ocr-fileformat

In this example the GUI would be available under http://localhost/ocr-fileformat/.

Usage

The project offers two functionalities, which can be accessd via a command line script (CLI), using a web interface (GUI) or in you own tools (API)

CLI

ocr-transform: Transformation of OCR output between OCR formats
ocr-validate: Validation of OCR output against OCR format schemas

GUI

The web interface is for testing validation and transformations. You can upload a file or select an input file by URL.

API

$PREFIX/share/ocr-fileformat/xslt - XSLT stylesheets
$PREFIX/share/ocr-fileformat/xsd - XSD schemas
$PREFIX/share/ocr-fileformat/script/transform - Transformation scripts
$PREFIX/share/ocr-fileformat/script/validate - Validation scripts

Transformation

Transformation CLI

Usage: ocr-transform [-dl] <input-fmt> <output-fmt> [<input> [<output>]] [-- <saxon_opts>]

For example, you can transform an ALTO XML to a hOCR file with:

ocr-transform alto hocr sample.xml sample.hocr

Or convert from ALTO XML (version 2.1) to hOCR with:

ocr-transform alto2.1 hocr sample.alto sample.hocr

You can also pass arguments directly to the Saxon CLI by passing them after a double dash (--). For example, to set the foo parameter to bar:

ocr-transform alto hocr sample.xml sample.hocr -- foo=bar

Try ocr-transform -h to get an overview:

Usage: ocr-transform [-dhLv]   [ []] [-- ]

    Options:
        --help    -h     Show this help
        --version -v     Show version
        --debug   -d     Increase debug level by 1, can be repeated
        --list    -L     List transformations

    Transformations:
        abbyy hocr
        abbyy page
        alto2.0 alto3.0
        alto2.0 alto3.1
        alto2.0 hocr
        alto2.1 alto3.0
        alto2.1 alto3.1
        alto2.1 hocr
        alto page
        alto text
        gcv hocr
        gcv page
        hocr alto2.0
        hocr alto2.1
        hocr page
        hocr text
        page alto
        page hocr
        page page2019
        page text
        tei hocr

    Saxon options:
        Usage: see http://www.saxonica.com/documentation/index.html#!using-xsl/commandline
        Options available: -? -a -catalog -config -cr -diag -dtd -ea -expand -explain -export -ext -im -init -it -jit -l -lib -license -m -nogo -now -o -opt -or -outval -p -quit -r -relocate -repeat -s -sa -scmin -strip -t -T -target -threads -TJ -Tlevel -Tout -TP -traceout -tree -u -val -versionmsg -warnings -x -xi -xmlversion -xsd -xsdversion -xsiloc -xsl -y
        Use -XYZ:? for details of option XYZ
        Params:
          param=value           Set stylesheet string parameter
          +param=filename       Set stylesheet document parameter
          ?param=expression     Set stylesheet parameter using XPath
          !param=value          Set serialization parameter

Transformation GUI

Select the Transform menu option. Choose a URL, an input and an output format. Click Transform.

Transformation API

The stylesheets are installed in $PREFIX/share/ocr-fileformat/xslt and can be used directly in your scripts and software. You will need to use an XSLT 2.0 capable stylesheet transformer.

Supported Transformations

From ╲ To	hOCR	ALTO	PAGEXML
hOCR	=	✓	✓
ALTO	✓	=	✓
PAGEXML	✓	✓	=
FineReader	✓	-	✓
Google Cloud Vision	✓	-	✓
TEI	✓	-	-

Validation

Usage: ocr-validate [-dhL]   []

    Options:
        --help    -h     Show this help
        --version -v     Show version
        --debug   -d     Increase debug level by 1, can be repeated
        --list    -L     List available schemas

    Schemas:
        hocr
        alto-1-0 alto-1-1 alto-1-2 alto-1-3 alto-1-4 alto-2-0 alto-2-1 alto-2-2-draft alto-3-0 alto-3-1 alto-3-2-draft alto-4-0 alto-4-1
        abbyy-6-schema-v1 abbyy-8-schema-v2 abbyy-9-schema-v1 abbyy-10-schema-v1
        page-2009-03-16 page-2010-01-12 page-2010-03-19 page-2013-07-15 page-2016-07-15 page-2017-07-15 page-2018-07-15 page-2019-07-15

Validation CLI

For example, to validate an XML file againt the ALTO 3.1 schema:

ocr-validate alto-3-1 myFile.alto

Validation GUI

Select the Validate menu option. Choose a URL and an schema. Click Validate.

Validation API

The XSD files are installed under $PREFIX/share/ocr-fileformat/xsd

Supported Validation Formats

	hOCR	ALTO	PAGEXML	FineReader	Google Cloud Vision
Validation	✓	✓	✓	✓	-

License

This is free software. You may use it under the terms of the MIT License.

During the installation process several projects are included (in ./vendor). These projects have different licenses:

Saxon HE 9.7, MPL.
ALTOXML schema, "Open Source" for ALTO <= 3.1, CC BY SA 4.0 since ALTO 4.0
PAGE schemas, ?
xsd-validator by Adrian Mouat @amouat, Apache 2.0
ABBYY FineReader XSD, ?
hOCR-to-ALTO by Filip Kriz @filak, MIT
hocr-spec by Konstantin Baierer @kba, MIT
gcv2hocr by Endo Michiaki, CC BY 4.0
format-converters by OCR-D, Apache 2.0
prima-page-converter by PRImA Research Lab , Apache 2.0

Comments

Converting hOCR to Alto

Hi, first thanks for making this tool.

I have questions using the GUI to convert hOCR to Alto XML.

My hOCR file looks as follows:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="unknown" lang="unknown">
  <head>
    <title>None</title>
    <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
    <meta name='ocr-system' content='gcv2hocr.py' />
    <meta name='ocr-langs' content='unknown' />
    <meta name='ocr-number-of-pages' content='1' />
    <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_line ocrx_word ocrp_lang'/>
  </head>
  <body>
    <div class='ocr_page' lang='unknown' title='bbox 0 0 1420 2068'>
        <div class='ocr_carea' lang='unknown' title='bbox 176 121 1420 2068'>
            <span class='ocr_line' id='line_0' title='bbox 678 121 747 168; baseline 0 -5'>
                <span class='ocrx_word' id='word_0_0' title='bbox 678 121 747 168'>2T</span>
            </span>
            <span class='ocr_line' id='line_1' title='bbox 383 184 572 218; baseline 0 -5'>
                <span class='ocrx_word' id='word_1_0' title='bbox 383 184 572 218'>Especially</span>
            </span>
            <span class='ocr_line' id='line_2' title='bbox 583 184 697 218; baseline 0 -5'>
                <span class='ocrx_word' id='word_2_0' title='bbox 583 184 697 218'>during</span>
            </span>
            <span class='ocr_line' id='line_3' title='bbox 722 188 775 215; baseline 0 -5'>
                <span class='ocrx_word' id='word_3_0' title='bbox 722 188 775 215'>the</span>
            </span>
            <span class='ocr_line' id='line_4' title='bbox 796 186 888 218; baseline 0 -5'>
                <span class='ocrx_word' id='word_4_0' title='bbox 796 186 888 218'>years</span>
            </span>
            <span class='ocr_line' id='line_5' title='bbox 904 184 977 218; baseline 0 -5'>
                <span class='ocrx_word' id='word_5_0' title='bbox 904 184 977 218'>1933</span>
            </span>
            <span class='ocr_line' id='line_6' title='bbox 1040 187 1110 218; baseline 0 -5'>
                <span class='ocrx_word' id='word_6_0' title='bbox 1040 187 1110 218'>1938</span>
            </span>

But the ALTO output from the GUI gives me two xml files, which look like this:

<alto xmlns="http://www.loc.gov/standards/alto/ns-v2#"
      xmlns:xlink="http://www.w3.org/1999/xlink"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v2# http://www.loc.gov/standards/alto/v2/alto-2-0.xsd">
   <Description>
      <MeasurementUnit>pixel</MeasurementUnit>
      <sourceImageInformation>
         <fileName/>
      </sourceImageInformation>
      <OCRProcessing ID="IdOcr">
         <ocrProcessingStep>
            <processingSoftware>
               <softwareName>gcv2hocr.py</softwareName>
               <softwareVersion>gcv2hocr.py</softwareVersion>
            </processingSoftware>
         </ocrProcessingStep>
      </OCRProcessing>
   </Description>
   <Layout>
      <Page ID="" PHYSICAL_IMG_NR="1" HEIGHT="" WIDTH="">
         <PrintSpace HEIGHT="" WIDTH="" VPOS="0" HPOS="0">
            <ComposedBlock ID="" HEIGHT="1947" WIDTH="1244" VPOS="121" HPOS="176"/>
         </PrintSpace>
      </Page>
   </Layout>
</alto>

and

<?xml version="1.0" encoding="utf-8"?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v2#"
      xmlns:xlink="http://www.w3.org/1999/xlink"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v2# http://www.loc.gov/standards/alto/alto.xsd">None2TEspeciallyduringtheyears19331938theGermanun-employmentwasfullyremoved.LikemanyothershealsothoughtthatNationlasocialismvouldcauseaneconomicrisejoiningtheSAinApril1937Inforeigncountriestoo,Nationalsocialismwasnotrecognizedinitslasterfectsinthosedays.Imayremindyouofthefactthate.g.LordRothermeredevotedaspecialcopyofthe"DailyMailtotheNSDAPandaman1iaeMrWinstonChurchillwritesinhisreminiscences:"AtthattimeIhadnonationalprejudicesagainstHitler.Iknewbut1ittleofhisopinionoflifeandpastandhisoharacter.TomymindHitlerwasrighttobeaGerman1ovinghiscountry"Nodoubt,thatevenmoresuchorsimilarutterancesofstatesmenareknown.Atthattimemyhusbandcouldnotforeseethatbyhisjoininghewouldpromoteorsupportacriminalaffair.In1937hewasbusyasanassistantfortheknow-ledgeofkinsattheAnthropologicInstituteoftheUnivezaityofVienna.InSept.1937hepassedtothegeneralSS,becausehecouldbebusyasanivestigatorofkins.WaenAustriauasannexed,hecouldjointheGermanPolice.Afberyearsoftroublesanddistressnowhegotasafepoşitionasanofficial.Whenhewascalledouttothefrontier-guard(controlofpassports)onApril1st,1938hismembershiptothegeneralSSwasextinguished.HislatertransfertotheSDandtotheWafen-SS"wasnotvoluntary.DhusmyhusbanddoesnotbelongtotheciroleofthosemembersoftheSSwhomustbecosideredasCriminalsaccordingtothejudgementsofuremberg,becauseonlythosecounttothemwhoweremembersofthe3SstillfterSept.1st,1939.ThelatercompulsoryassimilationofranksintheSDandthe"Waffen-s"isotconsideredasamembershipothe3Saspertherulingpracticeofall"SpruchkammerInthecourseofageneraltraining-planinin1944myhusbandcametotheKRIPOforthreemonthstobeemployedthereforinformetionpurposes.ThenBourmonthsfollowedat.theSIAPOtobetrained1ateroninother1inesotheGeImanPolice.AstherewasalackofmenattheSTAPO,theycausedthepro-longationofhiscommendandinFebr.1945histransfertotheSTAPO.MyhusbandhasseveraltimestriedtoleavetheSTAFOandf1nallyappliedforbeingemployedasavoluateeratthefront.A1lhisapplicationswererefused.FurthertrialsWouldbeperhapspunishedasadenialofobedienceoradecompo-sitionof,themilitgry.ref.3)InFebr.andMarcha945asamemberoftheArmedForoesofthethenGermanymyhusbandshotdownanalliedterror-flyereachi.e.anenemyeirforce-manwhohadfiredabwomenandchildrenatBensheim/Germanyinalowflight,andthisonaccouatofadirectmilitaryandthereforebindingorderofhisdirectsuperior.Hewasorderedtodosobytheleaderofhisunit,SS-SourmbannführerandcouscillortothegovernmentGIRKEorbyhesdeputySS-sturmbannführerandcouncillortotheKRIPOHELLENBROICHresp.InFébr.1945Girkeaskedbyphonethecom-petentCommanderoftheSIPOSS-OberführerTRUMMLER,whethertheorderissuedfromBerlinbesti1lvalidbywhichterror-flyersweretobelki1led.TrummleransweredintheaffirmativeandP.t.o.</alto>

I've not worked with ALTO formats before, but I'm thinking it shouldn't look like this? Please let me know what you think, any help would be greatly appreciated!

opened by asor12 21

Release v0.2.0?

I think we should create a new release. I started to draft one in GitHub, see https://github.com/UB-Mannheim/ocr-fileformat/releases . However, I am not sure, what has to been done with the release option in the Makefile. Is it enough to increase the version counter? Do you agree that we are now at v0.2.0?

opened by zuphilip 13
Fix conversion from ALTO to PAGE and vice versa
Fix order of arguments passed

Remove shell debugging (-x)

Handle input from STDIN

Add -convert-to ALTO argument needed for conversion from PAGE to ALTO
opened by stweil 11
Support for google cloud vision 2 hocr by @dinosauria123
Works, but ideally:

[x] use upstream repo

[x] delete temporary files

[ ] fall back to max x/y if width height unspecified

[ ] maybe port to more flexible language, e.g. python
opened by kba 11
Integrate PRIMA Labs PageConverter

Integrates https://github.com/PRImA-Research-Lab/prima-page-converter. Currently supports ALTO -> PAGE conversion but could be extended (also accepts Google Cloud Vision, hocr, older PAGE versions and FRXML).

@wrznr @maxnth @chreul

opened by kba 10

installation problem under macOS 10.13.6

Thanks for the great tool.

Right now when I run sudo make install I get the following output:

(base) MacBook-Pro:ocr-fileformat$ sudo make install
/Applications/Xcode.app/Contents/Developer/usr/bin/make -C vendor check
# download the dependencies
/Applications/Xcode.app/Contents/Developer/usr/bin/make -C vendor all
mkdir -p xsd
# copy Alto XSD
cd xsd && ln -sf ../vendor/alto-schema/*/*.xsd . && \
		for xsd in *.xsd;do \
			target_xsd=`echo $xsd|sed 's/.//g'|sed 's/-/./'`; \
			if [ ! -e $target_xsd ];then \
				mv -f $xsd $target_xsd; \
			fi; done
# copy PAGE XSD
# copy ABBYY XSD
cd xsd && ln -sf ../vendor/abbyy-schema/*.xsd .
mkdir -p xslt
# symlink hocr<->alto as well as the language codes lookup xml
cd xslt && ln -sf ../vendor/hOCR-to-ALTO/hocr2alto2.0.xsl hocr__alto2.0.xsl
cd xslt && ln -sf ../vendor/hOCR-to-ALTO/hocr2alto2.1.xsl hocr__alto2.1.xsl
cd xslt && ln -sf ../vendor/hOCR-to-ALTO/alto2hocr.xsl alto2.0__hocr.xsl
cd xslt && ln -sf ../vendor/hOCR-to-ALTO/alto2hocr.xsl alto2.1__hocr.xsl
cd xslt && ln -sf ../vendor/hOCR-to-ALTO/hocr2text.xsl hocr__text.xsl
cd xslt && ln -sf ../vendor/hOCR-to-ALTO/alto2text.xsl alto__text.xsl
cd xslt && ln -sf ../vendor/hOCR-to-ALTO/codes_lookup.xml codes_lookup.xml
cd xslt && ln -sf ../vendor/format-converters/page2hocr.xsl page__hocr.xsl
cd xslt && ln -sf alto2.0__alto3.0.xsl alto2.0__alto3.1.xsl
cd xslt && ln -sf alto2.0__alto3.0.xsl alto2.1__alto3.0.xsl
cd xslt && ln -sf alto2.0__alto3.0.xsl alto2.1__alto3.1.xsl
mkdir -p /usr/local/share/ocr-fileformat
cp -r script xsd xslt vendor lib.sh /usr/local/share/ocr-fileformat
mkdir -p /usr/local/bin
sed '/^SHAREDIR=/c SHAREDIR="/usr/local/share/ocr-fileformat"' bin/ocr-transform.sh > /usr/local/bin/ocr-transform
sed: 1: "/^SHAREDIR=/c SHAREDIR= ...": command c expects \ followed by text
make: *** [install] Error 1

The Docker image runs fine however.

What am I doing wrong?

Thanks again

opened by jtlz2 9

Convert Google Cloud Vision OCR output to hocr.

I have a question.

I try to use Google Cloud Vision API to OCR.

https://cloud.google.com/vision/

The output of the OCR results including the position of the texts.

I want to convert Google OCR output to hocr format, do you have any ideas ?

I already talked this subject here. Please check our previous discussions.

https://github.com/tmbdev/hocr-tools/issues/26

opened by dinosauria123 9
New Saxon version 10.2 is out

We can update to the new Saxon version 9.9.1.7 which is out since some days:

https://sourceforge.net/projects/saxon/files/Saxon-HE/9.9/

https://sourceforge.net/projects/saxon/files/Saxon-HE/9.9/SaxonHE9-9-1-7J.zip/download

In principle this only means a similar commit like in https://github.com/UB-Mannheim/ocr-fileformat/commit/4faff379843f4923960cbba6cbbd0a741fb4ffe6 but this should be tested then also.

opened by zuphilip 8

Proxy support

When a HTTP proxy is needed, conversion from PAGE to ALTO is failing:

# ocrd-fileformat-transform -I OCR-D-GT-PAGE -O ALTO
14:36:13.086 INFO ocrd-fileformat-transform - page --> alto: input file OCR-D-GT-PAGE_00000024 (PHYS_0024)
java.net.ConnectException: Connection timed out (Connection timed out)
        at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399)
        at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242)
        at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224)
        at java.base/java.net.Socket.connect(Socket.java:609)
        at java.base/java.net.Socket.connect(Socket.java:558)
        at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:182)
        at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:474)
        at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:569)
        at java.base/sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
        at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:341)
        at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:362)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1253)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1015)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1592)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1520)
        at java.base/java.net.URL.openStream(URL.java:1140)
        at org.primaresearch.io.xml.XmlValidator.getSchema(XmlValidator.java:53)
        at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.run(XmlPageWriter_Alto.java:200)
        at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.write(XmlPageWriter_Alto.java:115)
        at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:282)
        at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:161)
Could not initialise ALTO XML writer
java.lang.NullPointerException
        at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.run(XmlPageWriter_Alto.java:200)
        at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.write(XmlPageWriter_Alto.java:115)
        at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:282)
        at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:161)
14:38:23.306 ERROR ocrd-fileformat-transform - Transformation exited with return value 0 but no file was written.

Unfortunately with the network setup here, this also is a long wait for a connection error because packets are simply dropped...

The preferred solution for me would be that ocr-fileformat would parse the somewhat standard http_proxy environment variable and passes the correct parameter to java:

java -Dhttp.proxyHost=http-proxy.sbb.spk-berlin.de -Dhttp.proxyPort=3128 [...other parameters...]

opened by mikegerber 7

alto to text: too many spaces

Example alto excerpt:

<TextLine><String CONTENT="Wappen:"/><SP/><String CONTENT="Heimstatt;"/><SP/><String CONTENT="Heimstatt,">... ...

converts to text

Wappen:␣␣Heimstatt;␣␣Heimstatt,␣␣Neipperg,␣␣Gemmingen ... ...

opened by jbarth-ubhd 7

:arrow_up: Upgrade to new version of hOCR-to-ALTO

This solves #95 and #81 also no special features of ALTO 3.0 or ALTO 4.0 are considered in the transformations, but this would be anyways something for upstream.

opened by zuphilip 6
Feature request: Page concatenation during conversion

Transkribus (https://readcoop.eu/transkribus/?sc=Transkribus), which just reached 100 000 users, export PAGE and ALTO as a single file for every page and the actual page numbers are not stored in the files. In my workflow ALTO -> hOCR-> dsed I have to edit the page numbers in *.dsed files before using them as a valid djvused input (to use the transcription as the hidden text layer in a DjVu document). It would be nice to solve the problem in a general and elegant way.

opened by jsbien 0
[feature request] Support MacOS

The current bash scripts contain code which does not work on MacOS out of the box (incompatible usage of sed, associative arrays, maybe more). Users are forced to install newer versions of bash and sed (which might be undesired) to run it.

Perhaps all bash scripts should be replaced by Python3 scripts. python3 is already used in the code, and using it everywhere might even simplify the code. At least it would be portable. It would even be possible to provide ocr-fileformat in the Python Package Index PyPI.
enhancement

opened by stweil 0

page__text.xsl is not honoring the reading order

page__text.xsl is not honoring the reading order in the PAGE-XML (pc:ReadingOrder), which gives completely false results. For this page, I get this text (shortened):

% docker run --rm -it -v "$PWD":/data ubma/ocr-fileformat ocr-transform page text OCR-D-GT-PAGE/OCR-D-GT-PAGE_00000024.xml | head
               
20
Die
[22.]
[22.]
ein gleies vorgegeben, und ſo gar ſehr viele mahle gegen ae menſlie Mglikeit mit Gewalt for-
ciret worden zu ſeyn, behaupten wi, mithin neb dem Bredekaw, weler (§. 28. 29.)  in aen ſeinen
Auagen wiederſproen, mit der Pœna fal um do gewier zu belegen i, da
ſecund. Farin. Tit. 9. qu. 66. p. m. 320.
die Klage ſo wohl als das Zeugnß vor falſ und erditet mßen gehalten werden.
§. 35) So viel die von der Inquitin
write /dev/stdout: broken pipe

For comparison, dinglehopper-extract gives the correct text:

% dinglehopper-extract OCR-D-GT-PAGE/OCR-D-GT-PAGE_00000024.xml| head
20
rath mit einer Pœna fiſcali angeſehen worden, und ſolche durch des Hrn. Graffen von Königsfeld Vor-
ſpruch, nur aus Gnaden nachgelaſſen erhalten.
Sondern man hat auch dieſen 4. Wochen lang alle Abend bey der Inquiſitin gantz allein gelaſſen.
Binnen welcher gantzer Zeit der Schreiber Bredekaw beſtändig bey Ihme geweſen, und ſich in
der am 13 ten Octobr. a.c. in Judicio gegen ſeinen geweſenen Hrn. introducirter Appellation deſſen Bey-
raths bedienet hat;
§. 33) Dabenebenſt iſt der Schreiber binnen dieſer gantzen Zeit auf freyem Fuß geblieben, und
hat nicht nur durch ſeinen Conſulenten, ſondern auch, weilen der Inquiſitin ſelbſten in Ihrem Gefängnüß
ſo viele Freyheit gelaſſen worden, daß ſie frembden Beſuch von Ihren Anverwandten ohngehindert em-

Image from the ZIP (converted to JPEG), for easier understanding:

OCR-D-IMG_00000024

bug enhancement

opened by mikegerber 6

Web interface in Docker container/ Error when uploading document: "Must be either POST with the field 'file'...."
I am running the Web service in a Docker container. When trying to upload and process a file, I am getting the following Error: Must be either POST with file field 'file' or GET with param 'url'.

Environment:

MacOS 11.3.1

Docker Engine v20.10.0

ocr-transform v0.4.0
opened by cboulanger 2
Google Cloud Vision to PAGE-XML

It was mentioned before but @cneud just reminded me of https://github.com/PRImA-Research-Lab/cloud-vision-ocr-to-page . Should not be too hard to integrate and would allow using GCV results in OCR-D/Transkribus/OCR4all.

BTW: Has anyone experience with the Azure Computer Vision API in the context of OCR? As a sign of goodwill in times of Covid-19, they are currently offering a generous free tier including access to the vision API. Would be interesting to compare.

opened by kba 5

Releases(v0.5.0)

v0.5.0(Nov 8, 2022)
What's Changed

⬆️ Update JPageConverter to 1.5.05 by @mikegerber in https://github.com/UB-Mannheim/ocr-fileformat/pull/131

update hocr2alto to include filak/hOCR-to-ALTO#23 by @kba in https://github.com/UB-Mannheim/ocr-fileformat/pull/130

page schemas: use github not primaresearch.org by @kba in https://github.com/UB-Mannheim/ocr-fileformat/pull/132

Page to alto python by @kba in https://github.com/UB-Mannheim/ocr-fileformat/pull/134

[doc][fix] clear README cli links by @M3ssman in https://github.com/UB-Mannheim/ocr-fileformat/pull/141

Add ImageWare MyBib to ALTO conversion by karkraeg, fix #139 by @kba in https://github.com/UB-Mannheim/ocr-fileformat/pull/140

page__alto: process all arguments by @bertsky in https://github.com/UB-Mannheim/ocr-fileformat/pull/142

when converting to PAGE, always use latest schema by @bertsky in https://github.com/UB-Mannheim/ocr-fileformat/pull/146

docker: unlimit POST upload size, #136 by @kba in https://github.com/UB-Mannheim/ocr-fileformat/pull/137

Update Saxon-HE by @stweil in https://github.com/UB-Mannheim/ocr-fileformat/pull/144

Use git submodules by @stweil in https://github.com/UB-Mannheim/ocr-fileformat/pull/148

update page-to-alto by @bertsky in https://github.com/UB-Mannheim/ocr-fileformat/pull/152

page to text: rewrite by @bertsky in https://github.com/UB-Mannheim/ocr-fileformat/pull/151

Update SaxonHE to version 11.2 by @stweil in https://github.com/UB-Mannheim/ocr-fileformat/pull/149

vendor/Makefile: page-to-alto is phony by @bertsky in https://github.com/UB-Mannheim/ocr-fileformat/pull/154

New Contributors

@mikegerber made their first contribution in https://github.com/UB-Mannheim/ocr-fileformat/pull/131

@M3ssman made their first contribution in https://github.com/UB-Mannheim/ocr-fileformat/pull/141

@bertsky made their first contribution in https://github.com/UB-Mannheim/ocr-fileformat/pull/142

Full Changelog: https://github.com/UB-Mannheim/ocr-fileformat/compare/v0.4.0...v0.5.0
Source code(tar.gz)
Source code(zip)
v0.4.0(Sep 18, 2020)

Update JPageConverter and saxon9he, drop support for Python 2
Source code(tar.gz)
Source code(zip)
v0.3.2(Jul 9, 2020)
Fix error handling for missing wget, unzip or git

Source code(tar.gz)
Source code(zip)
v0.3.1(Jun 25, 2020)
Improve error handling for missing wget, unzip or git

Source code(tar.gz)
Source code(zip)
v0.3.0(Jan 9, 2020)
Improve PAGE support

Update ALTO support

Add new conversions, e.g. hOCR to TEI, ABBYY to hOCR, PAGE to ALTO, ABBYY / ALTO / GCV / hOCR to PAGE, GCV to hOCR

Add new command line option --version

Fix bugs

Source code(tar.gz)
Source code(zip)
ocr-fileformat_0.3.0.tar.gz(6.69 MB)
ocr-fileformat_0.3.0.zip(6.76 MB)
v0.2.3(Dec 11, 2017)
Fixed

Fix download button in web interface #73

Fix https URL in Docker builds #75

Changed

Tab bar above input #72

Example URLs via https

Added

make help

Source code(tar.gz)
Source code(zip)
ocr-fileformat_0.2.3.tar.gz(5.01 MB)
ocr-fileformat_0.2.3.zip(5.06 MB)
v0.2.2(Dec 10, 2017)
Support new transformation from google cloud vision format to hocr

Fix format switching in transform web interface

Produce valid HTML

Use eslint for JS code style checking

Use best practices for Dockerfile

Source code(tar.gz)
Source code(zip)
ocr-fileformat_0.2.2.tar.gz(5.01 MB)
ocr-fileformat_0.2.2.zip(5.06 MB)
v0.2.1(Feb 27, 2017)
Docker fixes (busybox/alpine incompatibilities + allow overriding web config) and add documentation for Docker https://github.com/UB-Mannheim/ocr-fileformat/pull/33, https://github.com/UB-Mannheim/ocr-fileformat/pull/45, https://github.com/UB-Mannheim/ocr-fileformat/pull/53

Update URLs to ABBYY schemas, add new PAGE format 2016-07-15 https://github.com/UB-Mannheim/ocr-fileformat/commit/fded289165d557ba016fc83f5fbbf034295313eb

Switch to official filak/hOCR-to-ALTO repo, linking language codes lookup xml https://github.com/UB-Mannheim/ocr-fileformat/pull/48, https://github.com/UB-Mannheim/ocr-fileformat/issues/46, https://github.com/UB-Mannheim/ocr-fileformat/pull/52

Source code(tar.gz)
Source code(zip)
ocr-fileformat_0.2.1.tar.gz(4.48 MB)
ocr-fileformat_0.2.1.zip(4.54 MB)
v0.2.0(Sep 13, 2016)
Add option to run arbitrary scripts: In addition to XSD/XSLT, arbitrary executable scripts can be placed in ./script/validate and ./script/transform/, written in Python, bash or compiled C code.

Validation: hocr against hocr-check from tmbdev/hocr-tools

Web interface: Download button for transformation results

Web interface: Support file uploads for transformation and validation

Enable ALTO/hocr to plain text transformations

Code cleanup of the shared shell script library

More details: https://github.com/UB-Mannheim/ocr-fileformat/compare/v0.1.0...v0.2.0
Source code(tar.gz)
Source code(zip)
ocr-fileformat_0.2.0.tar.gz(4.48 MB)
ocr-fileformat_0.2.0.zip(4.53 MB)
v0.1.0(Sep 5, 2016)

Source code(tar.gz)
Source code(zip)
ocr-fileformat_0.1.0.tar.gz(7.23 MB)
ocr-fileformat_0.1.0.zip(7.29 MB)
v0.0.2(Sep 12, 2016)
Add transformation from alto2 to alto3: alto2.0__alto3.0.xsl. Thanks to @cneud !

Normalize project name and fix some links

Makefile: release goal

More details: https://github.com/UB-Mannheim/ocr-fileformat/compare/v0.0.1...v0.0.2
Source code(tar.gz)
Source code(zip)
v0.0.1(May 18, 2016)
Initial commit

Transform hOCR <-> ALTO 2.0/2.1

Validate ALTO 1/2/3, ABBYY 6,8,9,10, PAGE

Source code(tar.gz)
Source code(zip)
ocr-fileformat_0.0.1.tar.gz(4.44 MB)
ocr-fileformat_0.0.1.zip(4.48 MB)

Owner

Universitätsbibliothek Mannheim

Mannheim University Library

GitHub Repository https://digi.bib.uni-mannheim.de/ocr-fileformat/

EQFace: An implementation of EQFace: A Simple Explicit Quality Network for Face Recognition

EQFace: A Simple Explicit Quality Network for Face Recognition The first face recognition network that generates explicit face quality online.

141 Dec 31, 2022

This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.

Handwritten Text Recognition (OCR) with MXNet Gluon These notebooks have been created by Jonathan Chung, as part of his internship as Applied Scientis

422 Jan 03, 2023

Deep LearningImage Captcha 2

滑动验证码深度学习识别本项目使用深度学习 YOLOV3 模型来识别滑动验证码缺口，基于 https://github.com/eriklindernoren/PyTorch-YOLOv3 修改。只需要几百张缺口标注图片即可训练出精度高的识别模型，识别效果样例：克隆项目运行命令： git cl

117 Dec 28, 2022

Handwritten_Text_Recognition

Deep Learning framework for Line-level Handwritten Text Recognition Short presentation of our project Introduction Installation 2.a Install conda envi

24 Jul 15, 2022

Distort a video using Seam Carving (video) and Vibrato effect (sound)

Distort videos Applies a Seam Carving algorithm (aka liquid rescale) on every frame of a video, and a vibrato effect on the audio to distort the video

6 Dec 06, 2022

Fast image augmentation library and easy to use wrapper around other libraries. Documentation: https://albumentations.ai/docs/ Paper about library: https://www.mdpi.com/2078-2489/11/2/125

Albumentations Albumentations is a Python library for image augmentation. Image augmentation is used in deep learning and computer vision tasks to inc

11.4k Jan 02, 2023

Just a script for detecting the lanes in any car game (not just gta 5) with specific resolution and road design ( very basic and limited )

GTA-5-Lane-detection Just a script for detecting the lanes in any car game (not just gta 5) with specific resolution and road design ( very basic and

4 Aug 01, 2021

Scan the MRZ code of a passport and extract the firstname, lastname, passport number, nationality, date of birth, expiration date and personal numer.

PassportScanner Works with 2 and 3 line identity documents. What is this With PassportScanner you can use your camera to scan the MRZ code of a passpo

441 Dec 24, 2022

Random maze generator and solver

Maze Generator and Solver I wrote a maze generator that works with two commonly known algorithms: Depth First Search and Randomized Prims. Both of the

10 Sep 23, 2022

Detecting Text in Natural Image with Connectionist Text Proposal Network (ECCV'16)

Detecting Text in Natural Image with Connectionist Text Proposal Network The codes are used for implementing CTPN for scene text detection, described

1.3k Dec 22, 2022

Code for the ACL2021 paper "Combining Static Word Embedding and Contextual Representations for Bilingual Lexicon Induction"

CSCBLI Code for our ACL Findings 2021 paper, "Combining Static Word Embedding and Contextual Representations for Bilingual Lexicon Induction". Require

12 Oct 08, 2022

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)

Related tags

Overview

ocr-fileformat

Installation

Docker

System-wide

Usage

CLI

GUI

API

Transformation

Transformation CLI

Transformation GUI

Transformation API

Supported Transformations

Validation

Validation CLI

Validation GUI

Validation API

Supported Validation Formats

License

Comments

Releases(v0.5.0)

v0.5.0(Nov 8, 2022)

What's Changed

New Contributors

v0.4.0(Sep 18, 2020)

v0.3.2(Jul 9, 2020)

v0.3.1(Jun 25, 2020)

v0.3.0(Jan 9, 2020)

v0.2.3(Dec 11, 2017)

v0.2.2(Dec 10, 2017)

v0.2.1(Feb 27, 2017)

v0.2.0(Sep 13, 2016)

v0.1.0(Sep 5, 2016)

v0.0.2(Sep 12, 2016)

v0.0.1(May 18, 2016)

Owner

Universitätsbibliothek Mannheim

EQFace: An implementation of EQFace: A Simple Explicit Quality Network for Face Recognition

This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.

Deep LearningImage Captcha 2

Handwritten_Text_Recognition

Distort a video using Seam Carving (video) and Vibrato effect (sound)

Fast image augmentation library and easy to use wrapper around other libraries. Documentation: https://albumentations.ai/docs/ Paper about library: https://www.mdpi.com/2078-2489/11/2/125

Just a script for detecting the lanes in any car game (not just gta 5) with specific resolution and road design ( very basic and limited )

Scan the MRZ code of a passport and extract the firstname, lastname, passport number, nationality, date of birth, expiration date and personal numer.

Random maze generator and solver

Detecting Text in Natural Image with Connectionist Text Proposal Network (ECCV'16)

Code for the ACL2021 paper "Combining Static Word Embedding and Contextual Representations for Bilingual Lexicon Induction"

Page to PAGE Layout Analysis Tool

2 telegram-bots: for image recognition and for text generation

A tool to enhance your old/damaged pictures built using python & opencv.

Automatically resolve RidderMaster based on TensorFlow & OpenCV

text detection mainly based on ctpn model in tensorflow, id card detect, connectionist text proposal network

docstrum

This is used to convert a string to an Image with Handwritten Characters.

Recognizing cropped text in natural images.

Generating .npy dataset and labels out of given image, containing numbers from 0 to 9, using opencv