Unja is a fast & light tool for fetching known URLs from Wayback Machine

Related tags

Web Crawlingunja
Overview


unja
Unja

Fetch Known Urls

What's Unja?

Unja is a fast & light tool for fetching known URLs from Wayback Machine, Common Crawl, Virus Total & AlienVault's Otx it uses a separate thread for each provider to optimize its speed and use Wayback resumption key to divide scan into multiple parts to handle a large scan & it uses direct filters on API to get only filtered data from API to do less work on your system.

Why Unja?

  • Supports Wayback/Common-Crawl/Virus-Total/Otx
  • Automatically handles rate limits and timeouts
  • Export results: text or detailed output with status,mime,length in JSON
  • MultiThreading: separate thread for each provider to fetch data simultaneously
  • Filters: apply filters dirtly on provider to avoid unnecessary data

Installing Unja

You can install Unja with pip as following:

pip3 install unja

or, by downloading this repository and running

python3 setup.py install

Updating Unja

You can update Unja with pip as following:

pip3 install unja -U

Usage

unja -h

This will display help for the tool.

Flag Description Example
-d doimain unja -d ninjhacks.com
--sub Include subdomain unja --sub
-p Providers (wayback commoncrawl otx virustotal) unja -p wayback
--wbf (default : statuscode:200 ~mimetype:html) ninjref --filter statuscode:200
--ccf (default : =status:200 ~mime:.*html) ninjref --filter =status:200
--wbl Wayback results per request (default : 10000) unja --wbl 1000
--otxl Otx results per request (default : 500) unja --otxl 500
-r Amount of retries for http client (default : 3) nnja -r 3
-v Enable verbose mode to show errors nnja -v
-j Enable json mode for detailed output in json format nnja -j
-s Silent mode don't print header nnja -s
--ucci Update CommonCrawl Index nnja --ucci
--vtkey Change VirusTotal Api in config nnja --vtkey

Output Methods

text = ( default ) Output urls only.

json = ( -j ) Output url,status,mime,length in json format it's can help you later filtering result based on those variables.

Filters

Filters directly apply on providers to get only useful filtered data from provider.

Wayback Commoncrawl Description
statuscode:200 =status:200 return only those urls which status code is 200
!statuscode:200 !=status:200 return only non 200 status code
mimetype:text/html mime:text/html return only those url which response type is text/html
!mimetype:text/html !=mime:text/html return only non text/html response type
~mimetype:html ~mime:.*html return all those url which have html word in response type
~original:unja ~url:.*unja return all those url which have unja word in url

Oneliners

Get only urls with parameters & status code 200

unja -s -d target.com --sub -p wayback commoncrawl --wbf 'statuscode:200 ~original:=' --ccf '=status:200 ~url:.*=' | anew | tee output

Looking for open redirects

unja -s -d target.com --sub -p wayback commoncrawl --wbf '~statuscode:30 ~original:=http' --ccf '~status:30 ~url:.*=http' | anew | tee output

Clean result ( Exclude images,css,javascripts,woff & 404)

unja -s -d target.com --sub -p wayback commoncrawl --wbf '!statuscode:404 ~!mimetype:image ~!mimetype:javascript ~!mimetype:css ~!mimetype:woff' --ccf '!=status:404 !~mime:.*image !~mime:.*javascript !~mime:.*css !~mime:.*woff' | anew | tee output

Let me know if you have any other good oneliner ./

You might also like...
Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)
Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

trafilatura: Web scraping tool for text discovery and retrieval Description Trafilatura is a Python package and command-line tool which seamlessly dow

Tool to scan for secret files on HTTP servers

snallygaster Finds file leaks and other security problems on HTTP servers. what? snallygaster is a tool that looks for files accessible on web servers

Goblyn is a Python tool focused to enumeration and capture of website files metadata.
Goblyn is a Python tool focused to enumeration and capture of website files metadata.

Goblyn Metadata Enumeration What's Goblyn? Goblyn is a tool focused to enumeration and capture of website files metadata. How it works? Goblyn will se

A low-code tool that generates python crawler code based on curl or url
A low-code tool that generates python crawler code based on curl or url

KKBA Intruoduction A low-code tool that generates python crawler code based on curl or url Requirement Python = 3.6 Install pip install kkba Usage Co

Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.
Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

A tool to easily scrape youtube data using the Google API

YouTube data scraper To easily scrape any data from the youtube homepage, a youtube channel/user, search results, playlists, and a single video itself

A tool for scraping and organizing data from NewsBank API searches

nbscraper Overview This simple tool automates the process of copying, pasting, and organizing data from NewsBank API searches. Curerntly, nbscrape onl

👁️ Tool for Data Extraction and Web Requests.
👁️ Tool for Data Extraction and Web Requests.

httpmapper 👁️ Project • Technologies • Installation • How it works • License Project 🚧 For educational purposes. This is a project that I developed,

This tool can be used to extract information from any website

WEB-INFO- This tool can be used to extract information from any website Install Termux and run the command --- $ apt-get update $ apt-get upgrade $ pk

Releases(v0.0.7)
Owner
Sheryar
Cyber Security Enthusiast / Programmer / Gamer / Crypto
Sheryar
Current Antarctic large iceberg positions derived from ASCAT and OSCAT-2

Iceberg Locations Antarctic large iceberg positions derived from ASCAT and OSCAT-2. All data collected here are from the NASA SCP website Overview Thi

Joel Hanson 5 Jul 27, 2022
This repo has the source code for the crawler and data crawled from auto-data.net

This repo contains the source code for crawler and crawled data of cars specifications from autodata. The data has roughly 45k cars

Tô Đức Anh 5 Nov 22, 2022
Google Developer Profile Badge Scraper

Google Developer Profile Badge Scraper It is a Google Developer Profile Web Scraper which scrapes for specific badges in a user's Google Developer Pro

Hemant Sachdeva 2 Feb 22, 2022
A low-code tool that generates python crawler code based on curl or url

KKBA Intruoduction A low-code tool that generates python crawler code based on curl or url Requirement Python = 3.6 Install pip install kkba Usage Co

8 Sep 20, 2021
tweet random sand cat pictures

sandcatbot setup pip3 install --user -r requirements.txt cp sandcatbot.example.conf sandcatbot.conf vim sandcatbot.conf running the first parameter i

jess 8 Aug 07, 2022
Html Content / Article Extractor, web scrapping lib in Python

Python-Goose - Article Extractor Intro Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a

Xavier Grangier 3.8k Jan 02, 2023
Python scrapper scrapping torrent website and download new movies Automatically.

torrent-scrapper Python scrapper scrapping torrent website and download new movies Automatically. If you like it Put a ⭐ on this repo 😇 Run this git

Fazil vk 1 Jan 08, 2022
Scraping Thailand COVID-19 data from the DDC's tableau dashboard

Scraping COVID-19 data from DDC Dashboard Scraping Thailand COVID-19 data from the DDC's tableau dashboard. Data is updated at 07:30 and 08:00 daily.

Noppakorn Jiravaranun 5 Jan 04, 2022
a way to scrape a database of all of the isef projects

ISEF Database This is a simple web scraper which gets all of the projects and abstract information from here. My goal for this is for someone to get i

William Kaiser 1 Mar 18, 2022
优化版本的京东茅台抢购神器

优化版本的京东茅台抢购神器

1.8k Mar 18, 2022
simple http & https proxy scraper and checker

simple http & https proxy scraper and checker

Neospace 11 Nov 15, 2021
京东茅台抢购

截止 2021/2/1 日,该项目已无法使用! 京东:约满即止,仅限京东实名认证用户APP端抢购,2月1日10:00开始预约,2月1日12:00开始抢购(京东APP需升级至8.5.6版本及以上) 写在前面 本项目来自 huanghyw - jd_seckill,作者的项目地址我找不到了,找到了再贴上

abee 73 Dec 03, 2022
Amazon web scraping using Scrapy Framework

Amazon-web-scraping-using-Scrapy-Framework Scrapy Scrapy is an application framework for crawling web sites and extracting structured data which can b

Sejal Rajput 1 Jan 25, 2022
FilmMikirAPI - A simple rest-api which is used for scrapping on the Kincir website using the Python and Flask package

FilmMikirAPI - A simple rest-api which is used for scrapping on the Kincir website using the Python and Flask package

UserGhost411 1 Nov 17, 2022
Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo. (Todas as infomações)

Guilherme Silva Uchoa 3 Oct 04, 2022
Web Content Retrieval for Humans™

Lassie Lassie is a Python library for retrieving basic content from websites. Usage import lassie lassie.fetch('http://www.youtube.com/watch?v

Mike Helmick 570 Dec 19, 2022
热搜榜-python爬虫+正则re+beautifulsoup+xpath

仓库简介 微博热搜榜, 参数wb 百度热搜榜, 参数bd 360热点榜, 参数360 csdn热榜接口, 下方查看 其他热搜待加入 如何使用? 注册vercel fork到你的仓库, 右上角 点击这里完成部署(一键部署) 请求参数 vercel配置好的地址+api?tit=+参数(仓库简介有参数信息

Harry 3 Jul 08, 2022
Works very well and you can ask for the type of image you want the scrapper to collect.

Works very well and you can ask for the type of image you want the scrapper to collect. Also follows a specific urls path depending on keyword selection.

Memo Sim 1 Feb 17, 2022
SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features.

SearchifyX SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features. SearchifyX lets you

28 Dec 20, 2022
Dailyiptvlist.com Scraper With Python

Dailyiptvlist.com scraper Info Made in python Linux only script Script requires to have wget installed Running script Clone repository with: git clone

1 Oct 16, 2021