NLP

T5 Project proposal

Topic Modeling and Clustering of News-Articles-and-Essays

Students:

Nasser Alshehri
Abdullah Bushnag
Abdulrhman Alqurashi

OVERVIEW

News come in different formats, different types and different categories. Here we attempt to use Topic modeling and Clustering to get answers on what each content containt based on its content and then we try to do it based only on its title.

The process would be: We load the data. Keep what we need from the data. Clean the text(ex:stopwords).

Build the bag of words for all documents. Build the bag of words for each document.

Vectorize the data. Run the LDA model. Run the model on all data and save the output to dataframe

Run the Clustering algorithm. Save the data to csv. Make the charts.

Data

The data is acquired from: https://components.one/datasets/all-the-news-articles-dataset

The Raw data containts 12 features: id, title, author, date, content, year, month, publication, category, digital, section, url.

The features we are using are only the 'title' and 'content'.

The data we are not interested in will be dropped/ignored.

The 'title' is the headling/name/title of the news/Article/Essay. The 'Content' is the body/content/Essay/Article/News itself.

TOOLS

Pandas Numpy Scikit-learn Matplotlib Seaborn nltk gensim

News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

Related tags

Overview

NLP

Students:

OVERVIEW

Data

TOOLS

Owner

Almost State-of-the-art Text Generation library

The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

숭실대학교 컴퓨터학부 전공종합설계프로젝트

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。

Checking spelling of form elements

The (extremely) naive sentiment classification function based on NBSVM trained on wisesight_sentiment

A Semi-Intelligent ChatBot filled with statistical and economical data for the Premier League.

A simple word search made in python

A Python 3.6+ package to run .many files, where many programs written in many languages may exist in one file.

Toward a Visual Concept Vocabulary for GAN Latent Space, ICCV 2021

Codes for processing meeting summarization datasets AMI and ICSI.

Chinese Named Entity Recognization (BiLSTM with PyTorch)

AI and Machine Learning workflows on Anthos Bare Metal.

Stack based programming language that compiles to x86_64 assembly or can alternatively be interpreted in Python

An attempt to map the areas with active conflict in Ukraine using open source twitter data.

YACLC - Yet Another Chinese Learner Corpus

The code from the whylogs workshop in DataTalks.Club on 29 March 2022

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022