Build a small, 3 domain internet using Github pages and Wikipedia and construct a crawler to crawl, render, and index.

Overview

TechSEO Crawler

Build a small, 3 domain internet using Github pages and Wikipedia and construct a crawler to crawl, render, and index.

TechSEO Screenshot

Play with the results here: Simple Search Engine

Please Note: The link above is hosted on a small AWS box, so if you have issues loading, try again later.

Slideshare is here: Building a Simple Crawler on a Toy Internet

Description

Web Folder

In order to crawl a small internet of sites, we have to create it. This tool creates 3 small sites from Wikipedia data and hosts them on Github Pages. The sites are not linked to any other site on the internet, but are linked to each other.

Main function

This tool attempts to implement a small ecosystem of 3 websites, along with a simple crawler, renderer, and indexer. While the author did research to construct the repo, it was a design feature to prefer simplicity over complexity. Items that are part of large crawling infrastructures, most notably disparate systems, and highly efficient code and data storage, are not part of this repo. We focus on simple representations of items such that it is more accessible to newer developers.

Parts:

  • PageRank
  • Chrome Headless Rendering
  • Text NLP Normalization
  • Bert Embeddings
  • Robots
  • Duplicate Content Shingling
  • URL Hashing
  • Document Frequency Functions (BM25 and TFIDF)

Made for a presentation at Tech SEO Boost

Getting Started

Get the repo

git clone https://github.com/jroakes/tech-seo-crawler.git

Dependencies

  • Please see the requirements.txt file for a list of dependencies.

It is strongly suggested to do the following, first, in a new, clean environment.

  • May need to install [Microsoft Build Tools] (http://go.microsoft.com/fwlink/?LinkId=691126&fixForIE=.exe.) and upgrade setup tools pip install --upgrade setuptools if you are on Windows.
  • Install PyTorch pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
  • See requirements-libraries.txt file for remaining library requirements. To install the frozen requirements this was developed with, use pip install -r requirements.txt

Install with:

pip install -r requirements.txt

Executing program

  1. Make sure you've created your three sites first. See README file in the web folder. Conversely, if you just want to use the crawler/renderer, you can run with the premade sites and skip to step 3.
  2. After creating your three sites, go to the config file and add the crawler_seed URL. This will be the organization name you created on github.io. For example: myorganization.github.io/
  3. Run streamlit run main.py in the terminal or command prompt. A new Browser window should open.
  4. The tool can also be run interactively with the Run.ipynb notebook in Jupyter.

Sharing

If you want to share your search engine for others to see, you can use Streamlit and Localtunnel.

  1. Install Localtunnel npm install -g localtunnel
  2. Start the tunnel with lt --port 80 --subdomain <create a unique sub-domain name>
  3. Start the Streamlit server with streamlit run main.py --server.port 80 --global.logLevel 'warning' --server.headless true --server.enableCORS false --browser.serverAddress <the unique subdomain from step 2>.localtunnel.me
  4. Navigate to https://<the unique subdomain from step 2>.localtunnel.me in your browser, or share the link with a friend.

Complete example:

In a new terminal:

npm install -g localtunnel
lt --port 80 --subdomain tech-seo-crawler

In another terminal:

cd /tech-seo-crawler/
activate techseo
streamlit run main.py --server.port 80 --global.logLevel 'warning' --server.headless true --server.enableCORS false --browser.serverAddress tech-seo-crawler.localtunnel.me

Troubleshooting

  • When running in streamlit we experienced a few connection closed errors during the Rendering process. If you experience this error just rerun the script by using the top right menu and clicking on rerun in streamlit.

Contributors

Contributors names and contact info

Version History

  • 0.1 - Alpha
    • Initial Release

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

Libraries

Topics

Owner
JR Oakes
Hacker, SEO, NC State fan, co-organizer of Raleigh and RTP Meetups, as well as @sengineland author. Tweets are my own.
JR Oakes
CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of Cancer

CSAW-M This repository contains code for CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of Cancer. Source code for tr

Yue Liu 7 Oct 11, 2022
Code for our method RePRI for Few-Shot Segmentation. Paper at http://arxiv.org/abs/2012.06166

Region Proportion Regularized Inference (RePRI) for Few-Shot Segmentation In this repo, we provide the code for our paper : "Few-Shot Segmentation Wit

Malik Boudiaf 138 Dec 12, 2022
Bringing Characters to Life with Computer Brains in Unity

AI4Animation: Deep Learning for Character Control This project explores the opportunities of deep learning for character animation and control as part

Sebastian Starke 5.5k Jan 04, 2023
Automatic packaging of the open-composite libs for OvGME

OvGME Packager for OpenXR – OpenComposite for DCS Note This repository is currently unsupported and needs to be migrated to the upstream OpenComposite

12 Nov 03, 2022
The Official Repository for "Generalized OOD Detection: A Survey"

Generalized Out-of-Distribution Detection: A Survey 1. Overview This repository is with our survey paper: Title: Generalized Out-of-Distribution Detec

Jingkang Yang 338 Jan 03, 2023
Pytorch Implementation of Auto-Compressing Subset Pruning for Semantic Image Segmentation

Pytorch Implementation of Auto-Compressing Subset Pruning for Semantic Image Segmentation Introduction ACoSP is an online pruning algorithm that compr

Merantix 8 Dec 07, 2022
LogAvgExp - Pytorch Implementation of LogAvgExp

LogAvgExp - Pytorch Implementation of LogAvgExp for Pytorch Install $ pip instal

Phil Wang 31 Oct 14, 2022
Official code for "Distributed Deep Learning in Open Collaborations" (NeurIPS 2021)

Distributed Deep Learning in Open Collaborations This repository contains the code for the NeurIPS 2021 paper "Distributed Deep Learning in Open Colla

Yandex Research 96 Sep 15, 2022
"Structure-Augmented Text Representation Learning for Efficient Knowledge Graph Completion"(WWW 2021)

STAR_KGC This repo contains the source code of the paper accepted by WWW'2021. "Structure-Augmented Text Representation Learning for Efficient Knowled

Bo Wang 60 Dec 26, 2022
This is a Keras implementation of a CNN for estimating age, gender and mask from a camera.

face-detector-age-gender This is a Keras implementation of a CNN for estimating age, gender and mask from a camera. Before run face detector app, expr

Devdreamsolution 2 Dec 04, 2021
Official PaddlePaddle implementation of Paint Transformer

Paint Transformer: Feed Forward Neural Painting with Stroke Prediction [Paper] [Paddle Implementation] Update We have optimized the serial inference p

TianweiLin 284 Dec 31, 2022
Easy to use and customizable SOTA Semantic Segmentation models with abundant datasets in PyTorch

Semantic Segmentation Easy to use and customizable SOTA Semantic Segmentation models with abundant datasets in PyTorch Features Applicable to followin

sithu3 530 Jan 05, 2023
Display, filter and search log messages in your terminal

Textualog Display, filter and search logging messages in the terminal. This project is powered by rich and textual. Some of the ideas and code in this

Rik Huygen 24 Dec 10, 2022
Implementing yolov4 target detection and tracking based on nao robot

Implementing yolov4 target detection and tracking based on nao robot

6 Apr 19, 2022
Code to reproduce experiments in the paper "Explainability Requires Interactivity".

Explainability Requires Interactivity This repository contains the code to train all custom models used in the paper Explainability Requires Interacti

Digital Health & Machine Learning 5 Apr 07, 2022
Code for the paper "On the Power of Edge Independent Graph Models"

Edge Independent Graph Models Code for the paper: "On the Power of Edge Independent Graph Models" Sudhanshu Chanpuriya, Cameron Musco, Konstantinos So

Konstantinos Sotiropoulos 0 Oct 26, 2021
A PyTorch implementation of ViTGAN based on paper ViTGAN: Training GANs with Vision Transformers.

ViTGAN: Training GANs with Vision Transformers A PyTorch implementation of ViTGAN based on paper ViTGAN: Training GANs with Vision Transformers. Refer

Hong-Jia Chen 127 Dec 23, 2022
An adaptive hierarchical energy management strategy for hybrid electric vehicles

An adaptive hierarchical energy management strategy This project contains the source code of an adaptive hierarchical EMS combining heuristic equivale

19 Dec 13, 2022
Implementation of "Efficient Regional Memory Network for Video Object Segmentation" (Xie et al., CVPR 2021).

RMNet This repository contains the source code for the paper Efficient Regional Memory Network for Video Object Segmentation. Cite this work @inprocee

Haozhe Xie 76 Dec 14, 2022
Image Segmentation and Object Detection in Pytorch

Image Segmentation and Object Detection in Pytorch Pytorch-Segmentation-Detection is a library for image segmentation and object detection with report

Daniil Pakhomov 732 Dec 10, 2022