PAGE XML format collection for document image page content and more

Last update: Nov 14, 2022

Related tags

Overview

PAGE-XML

PAGE XML format collection for document image page content and more

For an introduction, please see the following publication: http://www.primaresearch.org/publications/ICPR2010_Pletschacher_PAGE

The most actively used XML formats are:

PAGE XML for page content (regions, text lines, words, glyphs, reading order, text content, ...)
PAGE XML for layout analysis evaluation (evaluation profiles, evaluation results, ...)
PAGE XML for document image dewarping (dewarping grids)

All formats are defined by an XML schema, hosted officially on primaresearch.org: http://www.primaresearch.org/schema/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd http://www.primaresearch.org/schema/PAGE/eval/layout/2013-07-15/layouteval.xsd http://www.primaresearch.org/schema/PAGE/gts/dewarping/2014-08-26/dewarping.xsd

Please see the wiki for more information.

Note: The master branch contains the proposed changes for the next release.

Proposed media type for page content: "application/vnd.prima.page+xml"

Comments

Using “official” Exif schema as an alternative of extended image metadata?

After playing around with the extended PAGE image metadata fields, @Boenig and I wanted to suggest further additions. However, this would ultimately lead to an inclusion of a complete image metadata set into PAGE which might not be desirable. Relying on an already existing XML schema is probably an effective alternative. Luckily, such a schema is existing and, even better, it is maintained under the tutelage of the W3C: https://www.w3.org/2003/12/exif/

So, what do @chris1010010, @splet and @cneud think about simply including this rdf schema in PAGE XML?

opened by wrznr 7
add semantics to coordinate system
Coordinates are at the heart of stand-off annotation formats. In PAGE-XML, all visible elements must have a CoordsType, which must have a @points. There is even some syntax for that enforced by a regular expression. However, the standard lacks any semantics for the coordinate system whatsoever. There is not even a comment about this, so with luck, at least all implementors guessed consistently.

IMO we need to specify that:

@points always describes (a list of x-y pairs of) absolute pixel coordinates ("absolute" meaning they refer to the root image in PageType/@imageFilename with the upper left corner as 0,0)

Moreover, we should clarify whether:

@points has a topology of

(unordered) sets of points, or

a single (open or closed) path, or

multiple closed paths (and if so, whether orientation is relevant as in e.g. left=inside / right=outside)

@points must obey certain constraints like

are paths allowed to leave the parent element's polygon outline / bounding box, or maybe even the page's bounding box (i.e. become negative, which is currently forbidden by syntax)? And if not:

must they be closed along the parent element's polygon outline / bounding box, or may they stay open when intersecting it?

are paths required to be planar (i.e. have no cross-sections)? And if not:

how does the content area compute,

by union, or

by difference, or

by orientation (left-of-path or right-of-path)?

This is highly relevant for implementors, especially when polygon processing and AlternativeImage processing on multiple hierarchy levels in the presence of skew becomes common practise – which is currently happening within OCR-D (for showcases see our Tesseract and our Ocropy preprocessing and segmentation wrappers).

(Cf. altoxml/schema#49)
opened by bertsky 4
pagecontent: allow region recursion for Map, too

Add MapRegion to the xsd:choice list within RegionType (as in PageType, and as all the other region types).

(I don't see a reason why this should be excluded – most likely just forgotten.)

opened by bertsky 2
element TextStyle inside element Page

Hi, Sometimes it is necessary to define the Text Style (font...) for the whole page. For example if you want to create GroundTruth only for used fonts independent from text, word and glyph regions. Therefore my suggestion is to allow the element TextStyle within element Page.

opened by tboenig 2
Deploy 2018 schema
We use the 2018 version in @OCR-D for Ground Truth and expect software partners to produce PAGE based on the 2018 version.

It would help a lot interoperability-wise if the schema location was dereferenceable, i.e. if

http://www.primaresearch.org/schema/PAGE/gts/pagecontent/2018-07-15/pagecontent.xsd

would resolve to the XSD the same way that

http://www.primaresearch.org/schema/PAGE/gts/pagecontent/2017-07-15/pagecontent.xsd

does now.

@chris1010010 Can you create the folder and upload the XSD?

Thank you!
opened by kba 2
How can I represent the skew angle on page level?

While it's pretty neat that I can represent skew on region level, deskewing is typically (i.e. in ABBYY products) a process which is applied on page level only. How can I incorporate skew angles on page level in PAGE XML?

opened by wrznr 1
AlternativeImage for regions, lines, words and glyphs
For OCR-D we have the requirement to allow preprocessing not just on page level but on regions, text lines or words. For example, dewarping of individual text blocks or text lines.

In order to support these use cases, we extended the schema to allow pc:AlternativeImage as an optional first element within

RegionType

TextLineType

WordType

GlyphType

Would you consider to incorporate these extensions in the next PAGE release?

Thanks!

CC @wrznr @tboenig @cneud
opened by kba 1
Additional attributes for expressing more details on the image

We are currently setting up the OCR-D framework using PAGE XML. For the image pre-processing step, we'd like to propose a few additions to the image metadata header section. It would be great if they could be integrated into the “official” PAGE XML schema as well.

opened by Boenig 1
clarification on reading order sorting...
require @index to be ascending monotonically (but still across types)

follow-up on https://github.com/PRImA-Research-Lab/prima-page-viewer/issues/15#issuecomment-623605873
opened by bertsky 0
Clarify correct reading order index sorting

as a follow-up on https://github.com/PRImA-Research-Lab/prima-page-viewer/issues/15#issuecomment-623492135

If @index is not required to be contiguous (which I would be happier with), please let me know.

opened by bertsky 0
using this tool

@kba @stweil @splet @chris1010010 @Boenig Hello! thanks a lot for this great tool! Please how can I use this tool?How can I install it ?Wha kind of command can use?

opened by Tailor2019 5
missing @production / @secondaryLanguage
Is there a reason that

there is no @production in PageType (and only there)

there is no @secondaryLanguage in TextLineType (and only there)

or have they been forgotten?
opened by bertsky 1
add Page/@comments

In light of #25 and general consistency, I think there should also be a @comments under PageType (as with all other segment hierarchy types). This is especially useful for descriptors of the @imageFilename (in analogy to AlternativeImage/@comments).

opened by bertsky 0
standard/norm for LanguageSimpleType
In PAGE-XML there's @language / @primaryLanguage of type pc:LanguageSimpleType to identify the natural language of segments. Its documentation refers to ISO 639.x 2016-07-14, which I cannot make sense of. There's 639-1, 639-2 and 639-3, but AFAICT no standard that allows strings of arbitrary length (as in the PAGE-XML enumeration), and nothing shows up for 2016-07-14. This is problematic because exact 639 mappings are needed for software implementation and interoperability.

Take Norwegian for example:

<enumeration value="Norwegian"/> <enumeration value="Norwegian Bokmål"/> <enumeration value="Norwegian Nynorsk"/>

According to 639 these could be named no/nb/nn or nor/nob/nno, but how do we map that automatically, where do the strings derive from in PAGE-XML?
opened by bertsky 4
Semantics of textLineOrder and readingDirection
The schema documentation only says this:

readingDirection:

The direction in which text within lines should be read (order of words and characters), in addition to “textLineOrder”.

textLineOrder:

The order of text lines within the block, in addition to “readingDirection”.

Now, the values for both of these are stated in absolute terms (top-to-bottom, bottom-to-top, left-to-right, right-to-left), not relative to XML ordering (straight vs inverse).

So how exactly should they be interpreted?

W.r.t. @orientation: Before or after rotation?

W.r.t. XML ordering: Should elements always be "in order" already, or must they follow some absolute top-down left-right default?

W.r.t. each other: Is it an error if they are not orthogonal?

I have not found a single example anywhere in the repo. I found but 2 examples of @readingDirection="bottom-to-top" in the PRImA Layout Analysis Dataset, namely r13 in 00000408 and r3 in 00000394 – both of which are cases of @orientation=-90°. Is this correct?
opened by bertsky 3
support scale attribute for down/upsampled images
Since AlternativeImage has been introduced on every level of the structural hierarchy, these image files can be used to represent results from image preprocessing (normalization, denoising, binarization, non-text suppression, despeckling, deskewing, dewarping). Some of these operations can and some cannot be represented descriptively – but referencing derived images always helps avoiding repeated computations.

However, there's a difficulty/penalty involved: All coordinates in the PAGE hierarchy are referring to the original image (under /PcGts/Page/@imageFilename), whereas derived images (AlternativeImage/@filename under Page or Region or TextLine or Word) necessarily have different, local/relative coordinate system. It is connected to the global/absolute coordinate system only implicitly.

So if you want to process via derived images, like crop segments further down the hierarchy (translating from their absolute coordinates to the images' relative coordinates) or add further segmentation (translating from new relative coordinates in the images to new absolute coordinates), then you must know the transformation between them.

This could merely be an offset (which could be unambiguously defined as the top left of the bounding box of the element's polygon), which happens after cropping (on the page level or any segmentation below that). But there are certain operations which change coordinates non-trivially:

Deskewing will shift to the center of the element's bounding box, then rotate around that center, increasing the size of the bounding box (to avoid loosing content at the corners), and shifting back to the (new) top left of the bounding box. Alternatively, larger angles (e.g. multiples of 90°) could be applied by reflection instead of rotation.

Dewarping may change coordinates in any number of ways (3d shear or cubic spline projection, or interpolated raster grid, including as a special case centerline projection).

Rescaling or aspect correction will multiply coordinates by a constant factor.

All those effects are cumulative, i.e. they will compose into a new coordinate transform at each step, and in the order of the operations applied to the image (and its predecessors). This is not always trivial, e.g. cropping before/after deskewing, deskewing on page and then again on region level. It's certainly not rocket science, but (believe me) there are many ways you can get this wrong when you have to implement it.

Now, for cropping and deskewing, we are in the fortunate situation that – provided the operations applied on the derived image have been carried out in the "correct" way and documented in its @comments – their respective coordinate transform can be reconstructed from the descriptive information (Coords/@points and @orientation).

But for dewarping and rescaling we don't even have any descriptive annotation yet.

For dewarping, maybe the dewarping schema with its /DwGts/Grid/Row/@points is sufficient (although it is unfortunate that this schema is external to the content schema).

But for rescaling, there's nothing at all.

You could ask:

shouldn't we then allow annotating the coordinate transform explicitly?

why do you want to rescale?

1: I'd be happy to see PAGE adopt some representation of affine transformations (basically a 3x3 float array) under AlternativeImage/@coordinate-system. But I would still consider this only a redundant convenience feature.

2: Rescaling is useful under various scenarios:

avoid wasting computation on images with too large pixel density by downsampling them during processing

ensuring a fixed pixel density for operations that expect certain component sizes or distances (e.g. rule-based segmentation tools always assuming 300 DPI)

ensuring a fixed pixel resolution for operations that expect a certain image size (e.g. neural segmentation tools)

ensuring a fixed width/height aspect ratio during processing

Thus, I propose to at least introduce a descriptive annotation for derived images' scale factors:

AlternativeImage/@imageWidth (as in Page/@imageWidth)

AlternativeImage/@imageHeight (as in Page/@imageHeight)

AlternativeImage/@imageXResolution (as in Page/@imageXResolution)

AlternativeImage/@imageYResolution (as in Page/@imageYResolution)

AlternativeImage/@imageResolutionUnit (as in Page/@imageResolutionUnit)

AlternativeImage/@imageXScale (how much is AlternativeImage/@imageXResolution zoomed over Page/@imageXResolution?)

AlternativeImage/@imageYScale (how much is AlternativeImage/@imageYResolution zoomed over Page/@imageYResolution?)

(Of course, the latter 2 are redundant, but pixel density might not be known exactly/reliably and thus omitted / set to zero. In that case, the scale can still describe precisely the factor between the unknown density of the original image and the unknown density of the derived image.)
opened by bertsky 4

Releases(2019-07-15_2)

2019-07-15_2(Aug 24, 2019)

Removed doubleUnderlined (forgot to remove, superseded by underlineStyle doubleLine)
Source code(tar.gz)
Source code(zip)
PAGE_pagecontent_2019-07-15_2.zip(3.62 MB)
2019-07-15(Jul 16, 2019)

2019 format update
Source code(tar.gz)
Source code(zip)
PAGE_pagecontent_2019-07-15.zip(3.62 MB)
2018-07-15(Jul 19, 2018)

2018 page content format of PAGE XML framework
Source code(tar.gz)
Source code(zip)
PAGE_pagecontent_2018-07-15.zip(3.99 MB)
2017-07-15(Dec 14, 2017)

2017 page content format of PAGE XML framework
Source code(tar.gz)
Source code(zip)
PAGE_pagecontent_2017-07-15.zip(3.83 MB)

Owner

PRImA Research Lab

Pattern Recognition and Image Analysis Research Lab

GitHub Repository

Official code for "Bridging Video-text Retrieval with Multiple Choice Questions", CVPR 2022 (Oral).

Bridging Video-text Retrieval with Multiple Choice Questions, CVPR 2022 (Oral) Paper | Project Page | Pre-trained Model | CLIP-Initialized Pre-trained

99 Jan 06, 2023

Optical character recognition for Japanese text, with the main focus being Japanese manga

Manga OCR Optical character recognition for Japanese text, with the main focus being Japanese manga. It uses a custom end-to-end model built with Tran

327 Jan 01, 2023

Learning Camera Localization via Dense Scene Matching, CVPR2021

This repository contains code of our CVPR 2021 paper - "Learning Camera Localization via Dense Scene Matching" by Shitao Tang, Chengzhou Tang, Rui Hua

65 Dec 01, 2022

Implementation of EAST scene text detector in Keras

EAST: An Efficient and Accurate Scene Text Detector This is a Keras implementation of EAST based on a Tensorflow implementation made by argman. The or

208 Nov 15, 2022

Connect Aseprite to Blender for painting pixelart textures in real time

Pribambase Pribambase is a small tool that connects Aseprite and Blender, to allow painting with instant viewport feedback and all functionality of ex

117 Jan 03, 2023

CNN+LSTM+CTC based OCR implemented using tensorflow.

CNN_LSTM_CTC_Tensorflow CNN+LSTM+CTC based OCR(Optical Character Recognition) implemented using tensorflow. Note: there is No restriction on the numbe

356 Dec 08, 2022

A simple OCR API server, seriously easy to be deployed by Docker, on Heroku as well

ocrserver Simple OCR server, as a small working sample for gosseract. Try now here https://ocr-example.herokuapp.com/, and deploy your own now. Deploy

541 Dec 28, 2022

AdvancedEAST is an algorithm used for Scene image text detect, which is primarily based on EAST, and the significant improvement was also made, which make long text predictions more accurate.https://github.com/huoyijie/raspberrypi-car

AdvancedEAST AdvancedEAST is an algorithm used for Scene image text detect, which is primarily based on EAST:An Efficient and Accurate Scene Text Dete

1.2k Dec 29, 2022

Maze generator and solver with python

Procedural-Maze-Generator-Algorithms Check out my youtube channel : Auctux Ressources Thanks to Jamis Buck Book : Mazes for programmers Requirements P

19 Dec 07, 2022

第一届西安交通大学人工智能实践大赛（2018AI实践大赛--图片文字识别）第一名；仅采用densenet识别图中文字

OCR 第一届西安交通大学人工智能实践大赛（2018AI实践大赛--图片文字识别）冠军模型结果该比赛计算每一个条目的f1score，取所有条目的平均，具体计算方式在这里。这里的计算方式不对一句话里的相同文字重复计算，故f1score比提交的最终结果低： - train val f1score 0

441 Dec 22, 2022

Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

STN-OCR: A single Neural Network for Text Detection and Text Recognition This repository contains the code for the paper: STN-OCR: A single Neural Net

496 Jan 05, 2023

Shape Detection - It's a shape detection project with OpenCV and Python.

Shape Detection It's a shape detection project with OpenCV and Python. Setup pip install opencv-python for doing AI things. pip install simpleaudio fo

1 Nov 26, 2022

learn how to use Gesture Control to change the volume of a computer

Volume-Control-using-gesture In this project we are going to learn how to use Gesture Control to change the volume of a computer. We first look into h

49 Sep 22, 2022

A document scanner application for laptops/desktops developed using python, Tkinter and OpenCV.

DcoumentScanner A document scanner application for laptops/desktops developed using python, Tkinter and OpenCV. Directly install the .exe file to inst

1 Oct 29, 2021

PAGE XML format collection for document image page content and more

PAGE-XML PAGE XML format collection for document image page content and more For an introduction, please see the following publication: http://www.pri

46 Nov 14, 2022

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless. This is the official Roboflow python package that interfaces with the Roboflow API.

52 Dec 23, 2022

かの有名なあの東方二次創作ソング、「bad apple!」のMVをPythonでやってみたって話

bad apple!! 内容このプログラムは、bad apple!(feat. nomico)のPVをPythonを用いて再現しよう！という内容です。実はYoutube並びにGithub上に似たようなプログラムがあったしなんならそっちの方が結構良かったりするんですが、一応公開しますw 使い方こ

8 Jan 05, 2023

零样本学习测评基准，中文版

ZeroCLUE 零样本学习测评基准，中文版零样本学习是AI识别方法之一。简单来说就是识别从未见过的数据类别，即训练的分类器不仅仅能够识别出训练集中已有的数据类别，还可以对于来自未见过的类别的数据进行区分。这是一个很有用的功能，使得计算机能够具有知识迁移的能力，并无需任何训练数据，很符合现

27 Dec 10, 2022

DouZero is a reinforcement learning framework for DouDizhu - 斗地主AI

[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning | 斗地主AI

3.1k Jan 05, 2023

This is a passport scanning web service to help you scan, identify and validate your passport created with a simple and flexible design and ready to be integrated right into your system!

Passport-Recogniton-System This is a passport scanning web service to help you scan, identify and validate your passport created with a simple and fle

7 Jan 04, 2023

PAGE XML format collection for document image page content and more

Related tags

Overview

PAGE-XML

Comments

Releases(2019-07-15_2)

2019-07-15_2(Aug 24, 2019)

2019-07-15(Jul 16, 2019)

2018-07-15(Jul 19, 2018)

2017-07-15(Dec 14, 2017)

Owner

PRImA Research Lab

Official code for "Bridging Video-text Retrieval with Multiple Choice Questions", CVPR 2022 (Oral).

Optical character recognition for Japanese text, with the main focus being Japanese manga

Learning Camera Localization via Dense Scene Matching, CVPR2021

Implementation of EAST scene text detector in Keras

Connect Aseprite to Blender for painting pixelart textures in real time

CNN+LSTM+CTC based OCR implemented using tensorflow.

A simple OCR API server, seriously easy to be deployed by Docker, on Heroku as well

AdvancedEAST is an algorithm used for Scene image text detect, which is primarily based on EAST, and the significant improvement was also made, which make long text predictions more accurate.https://github.com/huoyijie/raspberrypi-car

Maze generator and solver with python

第一届西安交通大学人工智能实践大赛（2018AI实践大赛--图片文字识别）第一名；仅采用densenet识别图中文字

Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

Shape Detection - It's a shape detection project with OpenCV and Python.

learn how to use Gesture Control to change the volume of a computer

A document scanner application for laptops/desktops developed using python, Tkinter and OpenCV.

PAGE XML format collection for document image page content and more

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

かの有名なあの東方二次創作ソング、「bad apple!」のMVをPythonでやってみたって話

零样本学习测评基准，中文版

DouZero is a reinforcement learning framework for DouDizhu - 斗地主AI

This is a passport scanning web service to help you scan, identify and validate your passport created with a simple and flexible design and ready to be integrated right into your system!