Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Last update: Nov 05, 2021

Related tags

Overview

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

This repository provides two web crawlers to label domain names using the McAfee API (https://www.trustedsource.org/sources/index.pl) and IP reputation using the TALOS API (https://talosintelligence.com/), respectively.

Requirements

BeautifulSoup

Usage

Descriptions of the demonstration code are as follows.

To label the categories of a set of domains, put the domain list in 'data/domain_list.txt' and run 'demo_domain_label.py'. The program will label the (1) category (e.g., Malicious Sites- Parked Domain) as well as (2) risk level (e.g., High Risk) of each domain (using the McAfee API) and save the results in 'res/domain_labels.txt'. When the program continuously outputs ''-Retry-'', please stop the program and wait for a moment. After the waiting, you can start the program again, which can automatically skip the domains already labeled and continue to label the rest domains.
To label the reputation of a set of IP addresses, put the IP list in 'data/IP_list.txt' and run 'demo_IP_label.py'. The program will label the (1) email reputation as well as (2) web reputation (with 3 levels of Poor, Neutral, and Good) and save the results in 'res/IP_labels.txt'. When the program continuously outputs ''None'', please stop the program and wait for a moment. After the waiting, you can start the program again, which can automatically skip the IPs already labeled and continue to label the rest IPs.
An example domain name list (with 21,820 effective second-level domains) and an example IP list (with 67,751 IP addresses) are given in 'data/examples/example_domain_list.txt' and 'data/examples/example_IP_list.txt', repsectively. The corresponding labeled results are saved in 'res/examples/example_domain_labels.txt' and 'res/examples/example_IP_labels.txt', respectively.

If you have questions regarding this repository, you can contact the author via [[email protected]].

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Related tags

Overview

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Requirements

Usage

Owner

Twitter Scraper

Bigdata - This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

学习强国自动化百分百正确、瞬间答题，分值45分

A social networking service scraper in Python

A webdriver-based script for reserving Tsinghua badminton courts.

Python framework to scrape Pastebin pastes and analyze them

crypto currency scraping

Google Developer Profile Badge Scraper

Web crawling framework based on asyncio.

Scrape data on SpaceX: Capsules, Rockets, Cores, Roadsters, SpaceX Info

A scalable frontier for web crawlers

News, full-text, and article metadata extraction in Python 3. Advanced docs:

a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

A simple django-rest-framework api using web scraping

Transistor, a Python web scraping framework for intelligent use cases.

Danbooru scraper with python

Demonstration on how to use async python to control multiple playwright browsers for web-scraping

simple http & https proxy scraper and checker

Works very well and you can ask for the type of image you want the scrapper to collect.

爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Related tags

Overview

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Requirements

Usage

Owner

Twitter Scraper

Bigdata - This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

学习强国 自动化 百分百正确、瞬间答题，分值45分

A social networking service scraper in Python

A webdriver-based script for reserving Tsinghua badminton courts.

Python framework to scrape Pastebin pastes and analyze them

crypto currency scraping

Google Developer Profile Badge Scraper

Web crawling framework based on asyncio.

Scrape data on SpaceX: Capsules, Rockets, Cores, Roadsters, SpaceX Info

A scalable frontier for web crawlers

News, full-text, and article metadata extraction in Python 3. Advanced docs:

a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

A simple django-rest-framework api using web scraping

Transistor, a Python web scraping framework for intelligent use cases.

Danbooru scraper with python

Demonstration on how to use async python to control multiple playwright browsers for web-scraping

simple http & https proxy scraper and checker

Works very well and you can ask for the type of image you want the scrapper to collect.

爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

学习强国自动化百分百正确、瞬间答题，分值45分