NLP_0-project

Group project for MFIN7036. Our goal is to predict firm profitability with text-based competition measures¹. We are a "democratic" and collaborative group of five, and I mentioned our names based on our initial work division below 😄 .

Here is the outline of our project:

Data collection.

@LeiyuanHuo, jyang130, FanFanShark, xdc1999, gaojiamin1116

Based on file data-WRDS-list.csv, write a web-scraping algorithm to download all 10-Ks (html format) these companies filed to the SEC within 2010 to 2022 at Historical EDGAR documents, and rename them data-10K-COMPNAME-Year.html.
Parse html files to extract Business and MD&A sections.

Text Processing: feature extraction²

Part of Speech Tagging (POS) (mainly this method) to get product name, descriptions. Store these for each company.
Named Entity Recognition (NER) (also mainly this method) to get mentioned competitor names. Store these for each company.
Product texts: BoW and tf-idf for each company's product(s), and hopefully we have a term-product matrix then.
Competitor texts: definitely BoW, as we care about the frequency of being mentioned.
‼️ We also need to combine sector and firm size/market power into competitor texts and re-count.

Text Processing: feature transformation and representation²

Term-product matrix: calculate cosine similarity scores for products pairwise; use score threshold to cluster products into similar groups.
Term-product matrix: directly apply clustering method (e.g., KMeans clustering) to product vectors, and cluster them.

Econometric Analysis and Hypothesis Testing²

Multivariate regression: DV is profitability (e.g., sales, revenue, Tobin's q), IV is competition measures (one from similar product count, one from mentions as competitors), also include relevant control variables.
Cross-section portfolios: our competition measures are cross-sectional (one for each year), so we can create long-short portfolios for both measures, and examine stock return effects.

Two papers inspired this project. Citations: Eisdorfer, A., Froot, K., Ozik, G., & Sadka, R. (2021). Competition Links and Stock Returns. The Review of Financial Studies, The Review of financial studies, 2021-12-20. && Hoberg, G., & Phillips, G. (2016). Text-Based Network Industries and Endogenous Product Differentiation. The Journal of Political Economy, 124(5), 1423-1465. ↩
Text processing processes are based on MFIN7036 Lecture_Notes and a review paper. Citation: Marty, T., Vanstone, B., & Hahn, T. (2020). News media analytics in finance: A survey. Accounting and Finance (Parkville), 60(2), 1385-1434. ↩ ↩ ² ↩ ³

Group project for MFIN7036. Our goal is to predict firm profitability with text-based competition measures.

Related tags

Overview

NLP_0-project

Data collection.

Text Processing: feature extraction²

Text Processing: feature transformation and representation²

Econometric Analysis and Hypothesis Testing²

Owner

dualPC.R contains the R code for the main functions.

Registration Loss Learning for Deep Probabilistic Point Set Registration

Team nan solution repository for FPT data-centric competition. Data augmentation, Albumentation, Mosaic, Visualization, KNN application

Code release for NeRF (Neural Radiance Fields)

Robust Lane Detection via Expanded Self Attention (WACV 2022)

An Artificial Intelligence trying to drive a car by itself on a user created map

the code of the paper: Recurrent Multi-view Alignment Network for Unsupervised Surface Registration (CVPR 2021)

High performance distributed framework for training deep learning recommendation models based on PyTorch.

Hybrid Neural Fusion for Full-frame Video Stabilization

The world's largest toxicity dataset.

Official code for 'Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning' [ICCV 2021]

Simple helper library to convert a collection of numpy data to tfrecord, and build a tensorflow dataset from the tfrecord.

Multi-Scale Aligned Distillation for Low-Resolution Detection (CVPR2021)

Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral)

Lip Reading - Cross Audio-Visual Recognition using 3D Convolutional Neural Networks

RLHive: a framework designed to facilitate research in reinforcement learning.

Here I will explain the flow to deploy your custom deep learning models on Ultra96V2.

AniGAN: Style-Guided Generative Adversarial Networks for Unsupervised Anime Face Generation

[ACM MM 2021] TSA-Net: Tube Self-Attention Network for Action Quality Assessment

A curated list of programmatic weak supervision papers and resources

Group project for MFIN7036. Our goal is to predict firm profitability with text-based competition measures.

Related tags

Overview

NLP_0-project

Data collection.

Text Processing: feature extraction2

Text Processing: feature transformation and representation2

Econometric Analysis and Hypothesis Testing2

Footnotes

Owner

dualPC.R contains the R code for the main functions.

Registration Loss Learning for Deep Probabilistic Point Set Registration

Team nan solution repository for FPT data-centric competition. Data augmentation, Albumentation, Mosaic, Visualization, KNN application

Code release for NeRF (Neural Radiance Fields)

Robust Lane Detection via Expanded Self Attention (WACV 2022)

An Artificial Intelligence trying to drive a car by itself on a user created map

the code of the paper: Recurrent Multi-view Alignment Network for Unsupervised Surface Registration (CVPR 2021)

High performance distributed framework for training deep learning recommendation models based on PyTorch.

Hybrid Neural Fusion for Full-frame Video Stabilization

The world's largest toxicity dataset.

Official code for 'Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning' [ICCV 2021]

Simple helper library to convert a collection of numpy data to tfrecord, and build a tensorflow dataset from the tfrecord.

Multi-Scale Aligned Distillation for Low-Resolution Detection (CVPR2021)

Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral)

Lip Reading - Cross Audio-Visual Recognition using 3D Convolutional Neural Networks

RLHive: a framework designed to facilitate research in reinforcement learning.

Here I will explain the flow to deploy your custom deep learning models on Ultra96V2.

AniGAN: Style-Guided Generative Adversarial Networks for Unsupervised Anime Face Generation

[ACM MM 2021] TSA-Net: Tube Self-Attention Network for Action Quality Assessment

A curated list of programmatic weak supervision papers and resources

Text Processing: feature extraction²

Text Processing: feature transformation and representation²

Econometric Analysis and Hypothesis Testing²