NLP

T5 Project proposal

Topic Modeling and Clustering of News-Articles-and-Essays

Students:

Nasser Alshehri
Abdullah Bushnag
Abdulrhman Alqurashi

OVERVIEW

News come in different formats, different types and different categories. Here we attempt to use Topic modeling and Clustering to get answers on what each content containt based on its content and then we try to do it based only on its title.

The process would be: We load the data. Keep what we need from the data. Clean the text(ex:stopwords).

Build the bag of words for all documents. Build the bag of words for each document.

Vectorize the data. Run the LDA model. Run the model on all data and save the output to dataframe

Run the Clustering algorithm. Save the data to csv. Make the charts.

Data

The data is acquired from: https://components.one/datasets/all-the-news-articles-dataset

The Raw data containts 12 features: id, title, author, date, content, year, month, publication, category, digital, section, url.

The features we are using are only the 'title' and 'content'.

The data we are not interested in will be dropped/ignored.

The 'title' is the headling/name/title of the news/Article/Essay. The 'Content' is the body/content/Essay/Article/News itself.

TOOLS

Pandas Numpy Scikit-learn Matplotlib Seaborn nltk gensim

News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

Related tags

Overview

NLP

Students:

OVERVIEW

Data

TOOLS

Owner

Rootski - Full codebase for rootski.io (without the data)

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

AI_Assistant - This is a Python based Voice Assistant.

Multilingual word vectors in 78 languages

CPC-big and k-means clustering for zero-resource speech processing

Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

Treemap visualisation of Maya scene files

SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

Deduplication is the task to combine different representations of the same real world entity.

A tool helps build a talk preview image by combining the given background image and talk event description

This repo contains simple to use, pretrained/training-less models for speaker diarization.

Python api wrapper for JellyFish Lights

Unsupervised intent recognition

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

Pipelines de datos, 2021.

Implementation of some unbalanced loss like focal_loss, dice_loss, DSC Loss, GHM Loss et.al

DeepPavlov Tutorials

Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Poetry PEP 517 Build Backend & Core Utilities

A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You can find two approaches for achieving this in this repo.