topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API

Overview

NLP Space News Topic Modeling

Photos by nasa.gov (1, 2, 3, 4, 5) and extremetech.com

Binder Open In Colab nbviewer pre-commit CI CodeQL License: MIT OpenSource Code style: black prs-welcome pyup

Table of Contents

  1. Project Idea
  2. Data acquisition
  3. Analysis
  4. Usage
  5. Project Organization

Project Idea

This project aims to learn topics published in Space news from the Guardian (UK) news publication1.

1: articles were also retrieved from the blog Space.com (web scraping), the New York Times (space news from the science section) and from the Hubble Telescope news archive, but these data sources were not used in analysis

Data acquisition

Primary data source

News articles are retrieved using the official API provided by the Guardian.

Supplementary data sources

Data is also acquired from articles published by the Hubble Telescope, the New York Times (US) and blog publication Space.com

Although these articles were acquired, they were not used in analysis.

Data file creation

  1. Use 1_get_list_of_urls.ipynb
    • programmatically retrieves urls from API or archive of publication
    • retrieves metadata such as date and time, section, sub-section, headline/abstract/short summary, etc.
  2. Use 2_scrape_urls.ipynb
    • scrapes news article text from publication url
  3. Use 3_merge_scraped_and_filter.ipynb
    • merge metadata (1_get_list_of_urls.ipynb) with scraped article text (2_scrape_urls.ipynb)

Analysis

Analysis will be performed using an un-supervised learning model. Details are included in the 8_gensim_coherence_nlp_trials_v3.ipynb notebook in the root directory.

Usage

  1. Clone this repository
    $ git clone
  2. Create Python virtual environment, install packages and launch interactive Python platform
    $ make build
  3. Run notebooks in the following order
    • 3_merge_scraped_and_filter.ipynb (view) (covers data from the Hubble news feed, New York Times and Space.com)
      • merge multiple files of articles text data retrieved from news publications API or archive
      • filter out articles of less than 500 words
      • export to *.csv file for use in unsupervised machine learning models
    • 8_gensim_coherence_nlp_trials_v3.ipynb (view) (does not cover data from the Hubble news feed, New York Times and Space.com)
      • experiments in selecting number of topics using
        • coherence score from built-in coherence model to score Gensim's NMF
        • sklearn's implementation of TFIDF + NMF, using best number of topics found using Gensim's NMF
      • manually reading articles that NMF associates with each topic
    • 9_nlp_workflow.ipynb (view)
      • code-only version of 9_gensim_coherence_nlp_trials_v3.ipynb, with necessary considerations for deployment of topic model

Project Organization

├── .pre-commit-config.yaml       <- configuration file for pre-commit hooks
├── .github
│   ├── workflows
│       └── integrate.yml         <- configuration file for Github Actions
├── LICENSE
├── environment.yml               <- configuration file to create environment to run project on Binder
├── Makefile                      <- Makefile with commands like `make lint` or `make build`
├── README.md                     <- The top-level README for developers using this project.
├── app
│   ├── data                      <- data exported from training topic modeler, for use with API
|   └── tests                     <- Source code for use in API tests
|       ├── test-logs             <- Reports from running unit tests on API
|       └── testing_utils         <- Source code for use in unit tests
|           └── *.py              <- Scripts to use in testing API routes
|       ├── __init__.py           <- Allows Python modules to be imported from testing_utils
|       └── test_api.py           <- Unit tests for API
├── api.py                        <- Defines API routes
├── pytest.ini                    <- Test configuration
├── requirements.txt              <- Packages required to run and test API
├── s*,t*.py                      <- Scripts to use in defining API routes
├── data
│   ├── raw                       <- raw data retrieved from news publication
|   └── processed                 <- merged and filtered data
├── executed-notebooks            <- Notebooks with output.
├── *.ipynb                       <- Jupyter notebooks. Naming convention is a number (for ordering),
│                                    and a short `-` delimited description
├── requirements.txt              <- packages required to execute all Jupyter notebooks interactively (not from CI)
├── setup.py                      <- makes project pip installable (pip install -e .) so `src` can be imported
├── src                           <- Source code for use in this project.
│   ├── __init__.py               <- Makes src a Python module
│   └── *.py                      <- Scripts to use in analysis for pre-processing, training, etc.
├── papermill_runner.py           <- Python functions that execute system shell commands.
└── tox.ini                       <- tox file with settings for running tox; see tox.testrun.org

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Owner
edesz
edesz
spaCy plugin for Transformers , Udify, ELmo, etc.

Camphr - spaCy plugin for Transformers, Udify, Elmo, etc. Camphr is a Natural Language Processing library that helps in seamless integration for a wid

342 Nov 21, 2022
Paradigm Shift in NLP - "Paradigm Shift in Natural Language Processing".

Paradigm Shift in NLP Welcome to the webpage for "Paradigm Shift in Natural Language Processing". Some resources of the paper are constantly maintaine

Tianxiang Sun 41 Dec 30, 2022
This is a project built for FALLABOUT2021 event under SRMMIC, This project deals with NLP poetry generation.

FALLABOUT-SRMMIC 21 POETRY-GENERATION HINGLISH DESCRIPTION We have developed a NLP(natural language processing) model which automatically generates a

7 Sep 28, 2021
MMDA - multimodal document analysis

MMDA - multimodal document analysis

AI2 75 Jan 04, 2023
Transformers implementation for Fall 2021 Clinic

Installation Download miniconda3 if not already installed You can check by running typing conda in command prompt. Use conda to create an environment

Aakash Tripathi 1 Oct 28, 2021
T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets (product titles, images, comments, etc.).

55 Nov 22, 2022
Code for text augmentation method leveraging large-scale language models

HyperMix Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation. Getting Started Installing P

NAVER AI 47 Dec 20, 2022
precise iris segmentation

PI-DECODER Introduction PI-DECODER, a decoder structure designed for Precise Iris Segmentation and Location. The decoder structure is shown below: Ple

8 Aug 08, 2022
A BERT-based reverse dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end / back-end 임용

94 Dec 08, 2022
Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers an

Parv Bhatt 1 Jan 01, 2022
This is a Prototype of an Ai ChatBot "Tea and Coffee Supplier" using python.

Ai-ChatBot-Python A chatbot is an intelligent system which can hold a conversation with a human using natural language in real time. Due to the rise o

1 Oct 30, 2021
Grover is a model for Neural Fake News -- both generation and detectio

Grover is a model for Neural Fake News -- both generation and detection. However, it probably can also be used for other generation tasks.

Rowan Zellers 856 Dec 24, 2022
Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)

Universal Adversarial Triggers for Attacking and Analyzing NLP This is the official code for the EMNLP 2019 paper, Universal Adversarial Triggers for

Eric Wallace 248 Dec 17, 2022
A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

A Python package implementing a new model for text classification with visualization tools for Explainable AI 🍣 Online live demos: http://tworld.io/s

Sergio Burdisso 285 Jan 02, 2023
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP prod

VinAI Research 109 Dec 02, 2022
IMDB film review sentiment classification based on BERT's supervised learning model.

IMDB film review sentiment classification based on BERT's supervised learning model. On the other hand, the model can be extended to other natural language multi-classification tasks.

Paris 1 Apr 17, 2022
Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

AI-BOT Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

Thempra 2 Dec 21, 2022
Rhyme with AI

Local development Create a conda virtual environment and activate it: conda env create --file environment.yml conda activate rhyme-with-ai Install the

GoDataDriven 28 Nov 21, 2022
Blue Brain text mining toolbox for semantic search and structured information extraction

Blue Brain Search Source Code DOI Data & Models DOI Documentation Latest Release Python Versions License Build Status Static Typing Code Style Securit

The Blue Brain Project 29 Dec 01, 2022
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 08, 2022