Journalism AI – Quotes extraction for modular journalism

This repo contains the code for the Guardian and AFP contribution for the JournalismAI Festival 2021.

Further reading can be found in our blog post.

The aim of the project is to extract quotes from news articles using Named Entity Recognition, add coreferencing information and format the results for an exploratory search tool.

The contribution consists of several self-contained pieces of work, namely:

a regular expression pipeline attempting to extract quotes by matching patterns
a rule set to define different types of quotes and guide the quote annotation
custom annotation recipes for the Prodigy software enabling quick and efficient data annotation
a post-processing pipeline for extracting quotes using a trained Spacy model and adding coreferencing information
example data and data schema for displaying the extracted quote information in a search tool

Repo structure

Each folder in this repo reflects one of the pieces of work mentioned above.

regex_pipeline/ – code to run the regular expression-based quote extraction
annotation_rules/ – document with rules and definitions to guide the quote annotation step
annotation_scripts/ – custom annotation scripts for Prodigy
coreference/ – proof of concept for rules-based coreferencing tool
schema/ – data output schema and example data

Each folder contains a separate README file with instructions to set up and run each piece of work.

Journalism AI – Quotes extraction for modular journalism

Related tags

Overview

Journalism AI – Quotes extraction for modular journalism

Repo structure

Owner

Journalism AI collab 2021

Convolutional 2D Knowledge Graph Embeddings resources

Just Another Telegram Ai Chat Bot Written In Python With Pyrogram.

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Mkdocs + material + cool stuff

使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征，提升下游任务的表现。

Tool which allow you to detect and translate text.

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

Active learning for text classification in Python

Use the power of GPT3 to execute any function inside your programs just by giving some doctests

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

Training RNNs as Fast as CNNs

Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form.

Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part

A library for end-to-end learning of embedding index and retrieval model

🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

A python framework to transform natural language questions to queries in a database query language.

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Under the hood working of transformers, fine-tuning GPT-3 models, DeBERTa, vision models, and the start of Metaverse, using a variety of NLP platforms: Hugging Face, OpenAI API, Trax, and AllenNLP

A Fast Command Analyser based on Dict and Pydantic