Contact Extraction with Question Answering.

Overview

contactsQA

Extraction of contact entities from address blocks and imprints with Extractive Question Answering.

Goal

Input:

Dr. Max Mustermann
Hauptstraße 123
97070 Würzburg

Output:

entities = {
  "city" : "Würzburg",
  "email" : "",
  "fax" : "",
  "firstName" : "Max",
  "lastName" : "Mustermann",
  "mobile" : "",
  "organization" : "",
  "phone" : "",
  "position" : "",
  "street" : "Hauptstraße 123",
  "title" : "Dr.",
  "website" : "",
  "zip" : "97070"
}

Getting started

Creating a dataset

Due to data protection reasons, no dataset is included in this repository. You need to create a dataset in the SQuAD format, see https://huggingface.co/datasets/squad. Create the dataset in the jsonl-format where one line looks like this:

    {
        'id': '123',
        'title': 'mustermanns address',
        'context': 'Meine Adresse ist folgende: \n\nDr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg \n Schicken Sie mir bitte die Rechnung zu.',
        'fixed': 'Dr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg',
        'question': 'firstName',
        'answers': {
            'answer_start': [4],
            'text': ['Max']
        }
    }

Questions with no answers should look like this:

    {
        'id': '123',
        'title': 'mustermanns address',
        'context': 'Meine Adresse ist folgende: \n\nDr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg \n Schicken Sie mir bitte die Rechnung zu.',
        'fixed': 'Dr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg',
        'question': 'phone',
        'answers': {
            'answer_start': [-1],
            'text': ['EMPTY']
        }
    }

Split the dataset into a train-, validation- and test-dataset and save them in a directory with the name crawl, email or expected, like this:

├── data
│   ├── crawl
│   │   ├── crawl-test.jsonl
│   │   ├── crawl-train.jsonl
│   │   ├── crawl-val.jsonl

If you allow unanswerable questions like in SQuAD v2.0, add a -na behind the directory name, like this:

├── data
│   ├── crawl-na
│   │   ├── crawl-na-test.jsonl
│   │   ├── crawl-na-train.jsonl
│   │   ├── crawl-na-val.jsonl

Training a model

Example command for training and evaluating a dataset inside the crawl-na directory:

python app/qa-pipeline.py \
--batch_size 4 \
--checkpoint xlm-roberta-base \
--dataset_name crawl \
--dataset_path="../data/" \
--deactivate_map_caching \
--doc_stride 128 \
--epochs 3 \
--gpu_device 0 \
--learning_rate 0.00002 \
--max_answer_length 30 \
--max_length 384 \
--n_best_size 20 \
--n_jobs 8 \
--no_answers \
--overwrite_output_dir;

Virtual Environment Setup

Create and activate the environment (the python version and the environment name can vary at will):

$ python3.9 -m venv .env
$ source .env/bin/activate

To install the project's dependencies, activate the virtual environment and simply run (requires poetry):

$ poetry install

Alternatively, use the following:

$ pip install -r requirements.txt

Deactivate the environment:

$ deactivate

Troubleshooting

Common error:

ModuleNotFoundError: No module named 'setuptools'

The solution is to upgrade setuptools and then run poetry install or poetry update afterwards:

pip install --upgrade setuptools
Owner
Jan
Data Scientist (Working student) @snapADDY & Master student at Digital Humanities at Julius-Maximilians-University Würzburg.
Jan
Partially offline multi-language translator built upon Huggingface transformers.

Translate Command-line interface to translation pipelines, powered by Huggingface transformers. This tool can download translation models, and then us

Richard Jarry 8 Oct 25, 2022
Nested Named Entity Recognition for Chinese Biomedical Text

CBio-NAMER CBioNAMER (Nested nAMed Entity Recognition for Chinese Biomedical Text) is our method used in CBLUE (Chinese Biomedical Language Understand

8 Dec 25, 2022
A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

RE2 This is a pytorch implementation of the ACL 2019 paper "Simple and Effective Text Matching with Richer Alignment Features". The original Tensorflo

286 Jan 02, 2023
Two-stage text summarization with BERT and BART

Two-Stage Text Summarization Description We experiment with a 2-stage summarization model on CNN/DailyMail dataset that combines the ability to filter

Yukai Yang (Alexis) 6 Oct 22, 2022
a chinese segment base on crf

Genius Genius是一个开源的python中文分词组件,采用 CRF(Conditional Random Field)条件随机场算法。 Feature 支持python2.x、python3.x以及pypy2.x。 支持简单的pinyin分词 支持用户自定义break 支持用户自定义合并词

duanhongyi 237 Nov 04, 2022
Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

COCO LM Pretraining (wip) Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch. They were a

Phil Wang 44 Jul 28, 2022
:P Some basic stuff I'm gonna use for my upcoming Agile Software Development and Devops

reverse-image-search-py bash script.sh img_name.jpg Requirements pip install requests pip install pyshorteners Dry run [ Sudhanva M 3 Dec 18, 2021

Extract rooms type, door, neibour rooms, rooms corners nad bounding boxes, and generate graph from rplan dataset

Housegan-data-reader House-GAN++ (data-reader) Code and instructions for converting rplan dataset (raster images) to housegan++ data format. House-GAN

Sepid Hosseini 13 Nov 24, 2022
CredData is a set of files including credentials in open source projects

CredData is a set of files including credentials in open source projects. CredData includes suspicious lines with manual review results and more information such as credential types for each suspicio

Samsung 19 Sep 07, 2022
This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

Text Summarizer This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text. Team Members This mini-project was

1 Nov 16, 2021
结巴中文分词

jieba “结巴”中文分词:做最好的 Python 中文分词组件 "Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation

Sun Junyi 29.8k Jan 02, 2023
Milaan Parmar / Милан пармар / _米兰 帕尔马 170 Dec 13, 2022
SGMC: Spectral Graph Matrix Completion

SGMC: Spectral Graph Matrix Completion Code for AAAI21 paper "Scalable and Explainable 1-Bit Matrix Completion via Graph Signal Learning". Data Format

Chao Chen 8 Dec 12, 2022
code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

AttentiveNAS: Improving Neural Architecture Search via Attentive Sampling This repository contains PyTorch evaluation code, training code and pretrain

Facebook Research 94 Oct 26, 2022
🎐 a python library for doing approximate and phonetic matching of strings.

jellyfish Jellyfish is a python library for doing approximate and phonetic matching of strings. Written by James Turk James Turk 1.8k Dec 21, 2022

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

Plugin 3 Jan 12, 2022
A programming language with logic of Python, and syntax of all languages.

Pytov The idea was to take all well known syntaxes, and combine them into one programming language with many posabilities. Installation Install using

Yuval Rosen 14 Dec 07, 2022
Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization

Line as a Visual Sentence with LineTR This repository contains the inference code, pretrained model, and demo scripts of the following paper. It suppo

SungHo Yoon 158 Dec 27, 2022
Code examples for my Write Better Python Code series on YouTube.

Write Better Python Code This repository contains the code examples used in my Write Better Python Code series published on YouTube: https:/

858 Dec 29, 2022
Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

gpt-2-simple A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifical

Max Woolf 3.1k Jan 07, 2023