The code submitted for the Analytics Vidhya Jobathon - February 2022

Overview

Introduction

On February 11th, 2022, Analytics Vidhya conducted a 3-day hackathon in data science. The top candidates had the chance to be selected by various participating companies across a multitude of roles, in the field of data science, analytics and business intelligence.

The objective of the hackathon was to develop a machine learning approach to predict the engagement between a user and a video. The training data already had an engagement score feature, which was a floating-point number between 0 and 5. This considerably simplified matters, as in recommender systems, calculating the engagement score is often more challenging than predicting them. The challenge, therefore, was to predict this score based on certain user related and video related features.

The list of features in the dataset is given below:

Variable Description
row_id Unique identifier of the row
user_id Unique identifier of the user
category_id Category of the video
video_id Unique identifier of the video
age Age of the user
gender Gender of the user (Male and Female)
profession Profession of the user (Student, Working Professional, Other)
followers No. of users following a particular category
views Total views of the videos present in the particular category
engagement_score Engagement score of the video for a user

Initial Ideas

Two main approaches were considered:

  • Regression Models : Since the engagement score was a continuous variable, one could use a regression model. There were several reasons that I did not use this method:

    1. The lack of features on which to build the model. The features "user_id", "category_id" and "video_id" were discrete features that would need to be encoded. Since each of these features had many unique values, using a simple One-Hot Encoding would not work due to the increase in the number of dimensions. I also considered using other categorical encoders, like the CatBoost encoder, however, I've found that most categorical encoders only do well if the target value is a categorical variable, not a continuous variable like the case here. This would leave us with only 5 features on which to make the model, which never felt enough. The 5 features also had very little scope for feature engineering, apart from some ideas on combining gender and profession and age and profession.
    2. The fact that traditional regression models have never done well as recommender systems. The now famous Netflix challenge clearly showed the advantage collaborative filtering methods had over simple regressor models.
  • Collaborative Filtering : This was the option I eventually went with. There was enough evidence to see that Collaborative Filtering was in almost all cases better than regression models. There are many different ways of implementing collaborative filtering, of which I finally decided to use the Matrix Factorization method (SVD).

Final Model

The first few runs of collaborative filtering were not very successful, with a low r2 score on the test set. However, running a Grid Search over the 3 main hyperparameters – the number of factors, the number of epochs and the learning rate, soon gave the optimal SVD. The final hyperparameters were:

  • Number of Factors: 100
  • Number of Epochs: 500,
  • Learning Rate: 0.05

The r2 score on the test set at this point was 0.506, which took me to the top of the leaderboard.

The next step was to try to improve the model further. I decided to model the errors of the SVD model, so that the predictions of the SVD could be further adjusted by the error estimates. After trying a few different models, I selected the Linear Regression model to predict the errors.

To generate the error, the SVD model first had to run on a subset of the training set, so that the model could predict the error on the validation set. On this validation set, the Linear Regression was trained. The SVD was run on 95% of the training set, therefore, the regression was done on only 5% of the entire training set. The steps in the process were:

  1. Get the engagement score predictions using the SVD model for the validation set.
  2. Calculate the error.
  3. Train the model on this validation set, using the features – "age", "followers", "views", "gender", "profession" and "initial_estimate". The target variable was the error.
  4. Finally, run both models on the actual test set, first the SVD, then the Linear Regression.
  5. The final prediction is the difference between the initial estimates and the weighted error estimates. The error estimates were given a weight of 5%, since that was the proportion of data on which the Linear Regression model was trained.
  6. There could be scenarios of the final prediction going above 5 or below 0. In such cases, adjust the prediction to either 5 or 0.

The final r2 score was 0.532, an increase of 2.9 points.

Ideas for Improvement

There are many ways I feel the model can be further improved. Some of them are:

  1. Choosing the Correct Regression Model to Predict the Error : It was quite unexpected that a weak learner like Linear Regression did better than stronger models like Random Forest and XGBoost. I feel that the main reason for this is that dataset used to train these regressors were quite small, only 5% of the entire training set. While the linear regression model worked well with such a small dataset, the more complicated models did not.
  2. Setting the Correct Subset for the SVD : After trying a few different values, the SVD subset was set at 95% while the error subset set at only 5%. The reason for setting such a high percentage was that the SVD was the more powerful algorithm and I wanted that to be as accurate as possible. However, this severely compromised the error predictor. Finding the perfect balance here could improve the model performance.
  3. Selecting the Correct Weights for the Final Prediction : The final prediction was the difference between the initial estimate and the weighted error estimate. Further analysis is needed to get the most optimum weights. Ideally, the weights should not be needed at all.
  4. Feature Engineering : The error estimator had no feature engineering at all, in fact, I removed the feature "category_id" as well. Adding new features could potentially help in improving the error estimates, however, the benefits would be low, as it accounts for only 5% of the final prediction.
Control your gtps with gtps-tools!

Note Please give credit to me! Do not try to sell this app, because this app is 100% open source! Do not try to reupload and rename the creator app! S

Jesen N 6 Feb 16, 2022
Construção de um jogo Dominó na linguagem python com base em algoritmos personalizados.

Domino (projecto-python) Construção de um jogo Dominó na linguaguem python com base em algoritmos personalizados e na: Monografia apresentada ao curso

Nuninha-GC 1 Jan 12, 2022
VCC-Generator is a python script that generate VCC for testing purposes only

VCC-Generator is a python script that generate VCC for testing purposes only

Spider Anongreyhat 10 Oct 23, 2022
Mixtaper - Web app to make mixtapes

Mixtaper A web app which allows you to input songs in the form of youtube links

suryansh 1 Feb 14, 2022
ioztat is a storage load analysis tool for OpenZFS

ioztat is a storage load analysis tool for OpenZFS. It provides iostat-like statistics at an individual dataset/zvol level.

Jim Salter 116 Nov 25, 2022
Automated Changelog/release note generation

Quickly generate changelogs and release notes by analysing your git history. A tool written in python, but works on any language.

Documatic 95 Jan 03, 2023
tetrados is a tool to generate a density of states using the linear tetrahedron method from a band structure.

tetrados tetrados is a tool to generate a density of states using the linear tetrahedron method from a band structure. Currently, only VASP calculatio

Alex Ganose 1 Dec 21, 2021
Basic cryptography done in Python for study purposes

criptografia Criptografia básica feita em Python para fins de estudo Converte letras em numeros partindo do indice 0 e vice-versa A criptografia é fei

Carlos Eduardo 2 Dec 05, 2021
Sequence clustering and database creation using mmseqs, from local fasta files

Sequence clustering and database creation using mmseqs, from local fasta files

Ana Julia Velez Rueda 3 Oct 27, 2022
Project based on pure python with OOP

Object oriented programming review Object oriented programming (OOP) is among the most used programming paradigms (if not the most common) in the indu

Facundo Abrahan Cerimeli 1 May 09, 2022
Supply Chain will be a SAAS platfom to provide e-logistic facilites with most optimal

Shipp It Welcome To Supply Chain App [ Shipp It ] In "Shipp It" we are creating a full solution[web+app] for a entire supply chain from receiving orde

SAIKAT_CLAW 25 Dec 26, 2022
A python script providing an idea of how a MindSphere application, e.g., a dashboard, can be displayed around the clock without the need of manual re-authentication on enforced session expiration

A python script providing an idea of how a MindSphere application, e.g., a dashboard, can be displayed around the clock without the need of manual re-authentication on enforced session expiration

MindSphere 3 Jun 03, 2022
A bot to view Dilbert comics directly from Discord and get updates of the comics automatically.

A bot to view Dilbert comics directly from Discord and get updates of the comics automatically

Raghav Sharma 3 Nov 30, 2022
School helper, helps you at your pyllabus's.

pyllabus, helps you at your syllabus's... WARNING: It won't run without config.py! You should add config.py yourself, it will include your APIKEY. e.g

Ahmet Efe AKYAZI 6 Aug 07, 2022
NCAR/UCAR virtual Python Tutorial Seminar Series lesson on MetPy.

The Project Pythia Python Tutorial Seminar Series continues with a lesson on MetPy on Wednesday, 2 February 2022 at 1 PM Mountain Standard Time.

Project Pythia Tutorials 6 Oct 09, 2022
3D Printed Flip Clock Design and Code

Smart Flip Clock 3D printed smart clock that puts a new twist on old technology. Making The Smart Flip Clock The first thing that must be done for thi

Thomas 105 Oct 17, 2022
Script to use SysWhispers2 direct system calls from Cobalt Strike BOFs

SysWhispers2BOF Script to use SysWhispers2 direct system calls from Cobalt Strike BOFs. Introduction This script was initially created to fix specific

FalconForce 101 Dec 20, 2022
Simple Python Gemini browser with nice formatting

gg I wasn't satisfied with any of the other available Gemini clients, so I wrote my own. Requires Python 3.9 (maybe older, I haven't checked) and opti

Sarah Taube 2 Nov 21, 2021
Aesthetic NFT Generator

A E S T H E T I C Dependencies Pillow numpy OpenCV You can use pip to install any missing dependencies. Basic Usage Vaporwave artwork can be generated

Mentor Elezi 4 Mar 13, 2022
This is a small Panel applet for the Budgie Desktop to display the battery charge of a connected Bluetooth device.

BudgieBluetoothBattery This is a small Panel applet for the Budgie Desktop to display the battery charge of a connected Bluetooth device. It uses the

Konstantin Köhring 7 Dec 05, 2022