Kroomsa: A search engine for the curious

Last update: Jun 20, 2022

Overview

Kroomsa

A search engine for the curious. It is a search algorithm designed to engage users by exposing them to relevant yet interesting content during their session.

Description

The search algorithm implemented in your website greatly influences visitor engagement. A decent implementation can significantly reduce dependency on standard search engines like Google for every query thus, increasing engagement. Traditional methods look at terms or phrases in your query to find relevant content based on syntactic matching. Kroomsa uses semantic matching to find content relevant to your query. There is a blog post expanding upon Kroomsa's motivation and its technical aspects.

Getting Started

Prerequisites

Python 3.6.5
Run the project directory setup: python3 ./setup.py in the root directory.
Tensorflow's Universal Sentence Encoder 4
- The model is available at this link. Download the model and extract the zip file in the /vectorizer directory.
MongoDB is used as the database to collate Reddit's submissions. MongoDB can be installed following this link.
To fetch comments of the reddit submissions, PRAW is used. To scrape credentials are needed that authorize the script for the same. This is done by creating an app associated with a reddit account by following this link. For reference you can follow this tuorial written by Shantnu Tiwari.
- Register multiple instances and retrieve their credentials, then add them to the /config under bot_codes parameter in the following format: "client_id client_secret user_agent" as list elements separated by ,.
Docker-compose (For dockerized deployment only): Install the latest version following this link.

Installing

Create a python environment and install the required packages for preprocessing using: python3 -m pip install -r ./preprocess_requirements.txt
Collating a dataset of Reddit submissions
- Scraping posts
  - Pushshift's API is being used to fetch Reddit submissions. In the root directory, run the following command: python3 ./pre_processing/scraping/questions/scrape_questions.py. It launches a script that scrapes the subreddits sequentially till their inception and stores the submissions as JSON objects in /pre_processing/scraping/questions/scraped_questions. It then partitions the scraped submissions into as many equal parts as there are registered instances of bots.
- Scraping comments
  - After populating the configuration with bot_codes, we can begin scraping the comments using the partitioned submission files created while scraping submissions. Using the following command: python3 ./pre_processing/scraping/comments/scrape_comments.py multiple processes are spawned that fetch comment streams simultaneously.
- Insertion
  - To insert the submissions and associated comments, use the following commands: python3 ./pre_processing/db_insertion/insertion.py. It inserts the posts and associated comments in mongo.
  - To clean the comments and tag the posts that aren't public due to any reason, Run python3 ./post_processing/post_processing.py. Apart from cleaning, it also adds emojis to each submission object (This behavior is configurable).
Creating a FAISS Index
- To create a FAISS index, run the following command: python3 ./index/build_index.py. By default, it creates an exhaustive IDMap, Flat index but is configurable through the /config.
Database dump (For dockerized deployment)
- For dockerized deployment, a database dump is required in /mongo_dump. Use the following command at the root dir to create a database dump. mongodump --db database_name(default: red) --collection collection_name(default: questions) -o ./mongo_dump.

Execution

Local deployment (Using Gunicorn)
- Create a python environment and install the required packages using the following command: python3 -m pip install -r ./inference_requirements.txt
- A local instance of Kroomsa can be deployed using the following command: gunicorn -c ./gunicorn_config.py server:app
Dockerized demo
- Set the demo_mode to True in /config.
- Build images: docker-compose build
- Deploy: docker-compose up

Authors

License

This project is licensed under the Apache License Version 2.0

Kroomsa: A search engine for the curious

Related tags

Overview

Kroomsa

Description

Getting Started

Prerequisites

Installing

Execution

Authors

License

Owner

Wingify

Submanifold sparse convolutional networks

Self-Supervised Monocular DepthEstimation with Internal Feature Fusion(arXiv), BMVC2021

The source code of CVPR17 'Generative Face Completion'.

SciKit-Learn Laboratory (SKLL) makes it easy to run machine learning experiments.

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Code to reproduce the results in the paper "Tensor Component Analysis for Interpreting the Latent Space of GANs".

This PyTorch package implements MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation (NAACL 2022).

Code for T-Few from "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning"

efficient neural audio synthesis in the waveform domain

A deep learning network built with TensorFlow and Keras to classify gender and estimate age.

Scikit-learn compatible estimation of general graphical models

torchlm is aims to build a high level pipeline for face landmarks detection, it supports training, evaluating, exporting, inference(Python/C++) and 100+ data augmentations

Rlmm blender toolkit - A set of tools to streamline level generation in UDK straight from Blender

A decent AI that solves daily Wordle puzzles. Works with different websites with similar wordlists,.

Modeling CNN layers activity with Gaussian mixture model

PyTorch implementation of Interpretable Explanations of Black Boxes by Meaningful Perturbation

Official implementation of DreamerPro: Reconstruction-Free Model-Based Reinforcement Learning with Prototypical Representations in TensorFlow 2

Source Code of NeurIPS21 paper: Recognizing Vector Graphics without Rasterization

Official code of the paper "ReDet: A Rotation-equivariant Detector for Aerial Object Detection" (CVPR 2021)

Code Release for ICCV 2021 (oral), "AdaFit: Rethinking Learning-based Normal Estimation on Point Clouds"