Journalism AI – Quotes extraction for modular journalism

This repo contains the code for the Guardian and AFP contribution for the JournalismAI Festival 2021.

Further reading can be found in our blog post.

The aim of the project is to extract quotes from news articles using Named Entity Recognition, add coreferencing information and format the results for an exploratory search tool.

The contribution consists of several self-contained pieces of work, namely:

a regular expression pipeline attempting to extract quotes by matching patterns
a rule set to define different types of quotes and guide the quote annotation
custom annotation recipes for the Prodigy software enabling quick and efficient data annotation
a post-processing pipeline for extracting quotes using a trained Spacy model and adding coreferencing information
example data and data schema for displaying the extracted quote information in a search tool

Repo structure

Each folder in this repo reflects one of the pieces of work mentioned above.

regex_pipeline/ – code to run the regular expression-based quote extraction
annotation_rules/ – document with rules and definitions to guide the quote annotation step
annotation_scripts/ – custom annotation scripts for Prodigy
coreference/ – proof of concept for rules-based coreferencing tool
schema/ – data output schema and example data

Each folder contains a separate README file with instructions to set up and run each piece of work.

Journalism AI – Quotes extraction for modular journalism

Related tags

Overview

Journalism AI – Quotes extraction for modular journalism

Repo structure

Owner

Journalism AI collab 2021

Implementation of TTS with combination of Tacotron2 and HiFi-GAN

Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

Suite of 500 procedurally-generated NLP tasks to study language model adaptability

In this project, we aim to achieve the task of predicting emojis from tweets. We aim to investigate the relationship between words and emojis.

A website which allows you to play with the GPT-2 transformer

Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

A python package for deep multilingual punctuation prediction.

Translation to python of Chris Sims' optimization function

CYGNUS, the Cynical AI, combines snarky responses with uncanny aggression.

Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries.

Topic Inference with Zeroshot models

Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Simple text to phones converter for multiple languages

Paddlespeech Streaming ASR GUI

CoSENT、STS、SentenceBERT

Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks

topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API

MRC approach for Aspect-based Sentiment Analysis (ABSA)

뉴스 도메인 질의응답 시스템 (21-1학기 졸업 프로젝트)