CDLA: A Chinese document layout analysis (CDLA) dataset

Last update: Dec 28, 2022

Related tags

Overview

CDLA: A Chinese document layout analysis (CDLA) dataset

介绍

CDLA是一个中文文档版面分析数据集，面向中文文献类（论文）场景。包含以下10个label：

正文	标题	图片	图片标题	表格	表格标题	页眉	页脚	注释	公式
Text	Title	Figure	Figure caption	Table	Table caption	Header	Footer	Reference	Equation

共包含5000张训练集和1000张验证集，分别在train和val目录下。每张图片对应一个同名的标注文件(.json)。

样例展示：

下载链接

百度云下载：https://pan.baidu.com/s/1449mhds2ze5JLk-88yKVAA, 提取码: tp0d
Google Drive Download：https://drive.google.com/file/d/14SUsp_TG8OPdK0VthRXBcAbYzIBjSNLm/view?usp=sharing

标注格式

我们的标注工具是labelme，所以标注格式和labelme格式一致。这里说明一下比较重要的字段。

"shapes": shapes字段是一个list，里面有多个dict，每个dict代表一个标注实例。

"labels": 类别。

"points": 实例标注。因为我们的标注是Polygon形式，所以points里的坐标数量可能大于4。

"shape_type": "polygon"

"imagePath": 图片路径/名

"imageHeight": 高

"imageWidth": 宽

展示一个完整的标注样例:

{
  "version":"4.5.6",
  "flags":{},
  "shapes":[
    {
      "label":"Title",
      "points":[
        [
          553.1111111111111,
          166.59259259259258
        ],
        [
          553.1111111111111,
          198.59259259259258
        ],
        [
          686.1111111111111,
          198.59259259259258
        ],
        [
          686.1111111111111,
          166.59259259259258
        ]
      ],
      "group_id":null,
      "shape_type":"polygon",
      "flags":{}
    },
    {
      "label":"Text",
      "points":[
        [
          250.5925925925925,
          298.0740740740741
        ],
        [
          250.5925925925925,
          345.0740740740741
        ],
        [
          188.5925925925925,
          345.0740740740741
        ],
        [
          188.5925925925925,
          410.0740740740741
        ],
        [
          188.5925925925925,
          456.0740740740741
        ],
        [
          324.5925925925925,
          456.0740740740741
        ],
        [
          324.5925925925925,
          410.0740740740741
        ],
        [
          1051.5925925925926,
          410.0740740740741
        ],
        [
          1051.5925925925926,
          345.0740740740741
        ],
        [
          1052.5925925925926,
          345.0740740740741
        ],
        [
          1052.5925925925926,
          298.0740740740741
        ]
      ],
      "group_id":null,
      "shape_type":"polygon",
      "flags":{}
    },
    {
      "label":"Footer",
      "points":[
        [
          1033.7407407407406,
          1634.5185185185185
        ],
        [
          1033.7407407407406,
          1646.5185185185185
        ],
        [
          1052.7407407407406,
          1646.5185185185185
        ],
        [
          1052.7407407407406,
          1634.5185185185185
        ]
      ],
      "group_id":null,
      "shape_type":"polygon",
      "flags":{}
    }
  ],
  "imagePath":"val_0031.jpg",
  "imageData":null,
  "imageHeight":1754,
  "imageWidth":1240
}

转coco格式

执行命令:

# train
python3 labelme2coco.py CDLA_dir/train train_save_path  --labels labels.txt

# val
python3 labelme2coco.py CDLA_dir/val val_save_path  --labels labels.txt

转换结果保存在train_save_path/val_save_path目录下。

labelme2coco.py取自labelme，更多信息请参考labelme官方项目

CDLA: A Chinese document layout analysis (CDLA) dataset

Related tags

Overview

CDLA: A Chinese document layout analysis (CDLA) dataset

介绍

下载链接

标注格式

转coco格式

Owner

buptlihang

Ray-based parallel data preprocessing for NLP and ML.

CorNet Correlation Networks for Extreme Multi-label Text Classification

Shellcode antivirus evasion framework

Japanese synonym library

End-to-end text to speech system using gruut and onnx. There are 40 voices available across 8 languages.

Behavioral Testing of Clinical NLP Models

Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Repository of the Code to Chatbots, developed in Python

The Easy-to-use Dialogue Response Selection Toolkit for Researchers

Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

To be a next-generation DL-based phenotype prediction from genome mutations.

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.

NLP and Text Generation Experiments in TensorFlow 2.x / 1.x

Natural Language Processing Tasks and Examples.

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine

BERT score for text generation

Différents programmes créant une interface graphique a l'aide de Tkinter pour simplifier la vie des étudiants.

Code for ACL 2020 paper "Rigid Formats Controlled Text Generation"

A programming language with logic of Python, and syntax of all languages.