Unja is a fast & light tool for fetching known URLs from Wayback Machine

Related tags

Web Crawlingunja
Overview


unja
Unja

Fetch Known Urls

What's Unja?

Unja is a fast & light tool for fetching known URLs from Wayback Machine, Common Crawl, Virus Total & AlienVault's Otx it uses a separate thread for each provider to optimize its speed and use Wayback resumption key to divide scan into multiple parts to handle a large scan & it uses direct filters on API to get only filtered data from API to do less work on your system.

Why Unja?

  • Supports Wayback/Common-Crawl/Virus-Total/Otx
  • Automatically handles rate limits and timeouts
  • Export results: text or detailed output with status,mime,length in JSON
  • MultiThreading: separate thread for each provider to fetch data simultaneously
  • Filters: apply filters dirtly on provider to avoid unnecessary data

Installing Unja

You can install Unja with pip as following:

pip3 install unja

or, by downloading this repository and running

python3 setup.py install

Updating Unja

You can update Unja with pip as following:

pip3 install unja -U

Usage

unja -h

This will display help for the tool.

Flag Description Example
-d doimain unja -d ninjhacks.com
--sub Include subdomain unja --sub
-p Providers (wayback commoncrawl otx virustotal) unja -p wayback
--wbf (default : statuscode:200 ~mimetype:html) ninjref --filter statuscode:200
--ccf (default : =status:200 ~mime:.*html) ninjref --filter =status:200
--wbl Wayback results per request (default : 10000) unja --wbl 1000
--otxl Otx results per request (default : 500) unja --otxl 500
-r Amount of retries for http client (default : 3) nnja -r 3
-v Enable verbose mode to show errors nnja -v
-j Enable json mode for detailed output in json format nnja -j
-s Silent mode don't print header nnja -s
--ucci Update CommonCrawl Index nnja --ucci
--vtkey Change VirusTotal Api in config nnja --vtkey

Output Methods

text = ( default ) Output urls only.

json = ( -j ) Output url,status,mime,length in json format it's can help you later filtering result based on those variables.

Filters

Filters directly apply on providers to get only useful filtered data from provider.

Wayback Commoncrawl Description
statuscode:200 =status:200 return only those urls which status code is 200
!statuscode:200 !=status:200 return only non 200 status code
mimetype:text/html mime:text/html return only those url which response type is text/html
!mimetype:text/html !=mime:text/html return only non text/html response type
~mimetype:html ~mime:.*html return all those url which have html word in response type
~original:unja ~url:.*unja return all those url which have unja word in url

Oneliners

Get only urls with parameters & status code 200

unja -s -d target.com --sub -p wayback commoncrawl --wbf 'statuscode:200 ~original:=' --ccf '=status:200 ~url:.*=' | anew | tee output

Looking for open redirects

unja -s -d target.com --sub -p wayback commoncrawl --wbf '~statuscode:30 ~original:=http' --ccf '~status:30 ~url:.*=http' | anew | tee output

Clean result ( Exclude images,css,javascripts,woff & 404)

unja -s -d target.com --sub -p wayback commoncrawl --wbf '!statuscode:404 ~!mimetype:image ~!mimetype:javascript ~!mimetype:css ~!mimetype:woff' --ccf '!=status:404 !~mime:.*image !~mime:.*javascript !~mime:.*css !~mime:.*woff' | anew | tee output

Let me know if you have any other good oneliner ./

You might also like...
Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)
Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

trafilatura: Web scraping tool for text discovery and retrieval Description Trafilatura is a Python package and command-line tool which seamlessly dow

Tool to scan for secret files on HTTP servers

snallygaster Finds file leaks and other security problems on HTTP servers. what? snallygaster is a tool that looks for files accessible on web servers

Goblyn is a Python tool focused to enumeration and capture of website files metadata.
Goblyn is a Python tool focused to enumeration and capture of website files metadata.

Goblyn Metadata Enumeration What's Goblyn? Goblyn is a tool focused to enumeration and capture of website files metadata. How it works? Goblyn will se

A low-code tool that generates python crawler code based on curl or url
A low-code tool that generates python crawler code based on curl or url

KKBA Intruoduction A low-code tool that generates python crawler code based on curl or url Requirement Python = 3.6 Install pip install kkba Usage Co

Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.
Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

A tool to easily scrape youtube data using the Google API

YouTube data scraper To easily scrape any data from the youtube homepage, a youtube channel/user, search results, playlists, and a single video itself

A tool for scraping and organizing data from NewsBank API searches

nbscraper Overview This simple tool automates the process of copying, pasting, and organizing data from NewsBank API searches. Curerntly, nbscrape onl

👁️ Tool for Data Extraction and Web Requests.
👁️ Tool for Data Extraction and Web Requests.

httpmapper 👁️ Project • Technologies • Installation • How it works • License Project 🚧 For educational purposes. This is a project that I developed,

This tool can be used to extract information from any website

WEB-INFO- This tool can be used to extract information from any website Install Termux and run the command --- $ apt-get update $ apt-get upgrade $ pk

Releases(v0.0.7)
Owner
Sheryar
Cyber Security Enthusiast / Programmer / Gamer / Crypto
Sheryar
Python script to check if there is any differences in responses of an application when the request comes from a search engine's crawler.

crawlersuseragents This Python script can be used to check if there is any differences in responses of an application when the request comes from a se

Podalirius 13 Dec 27, 2022
Instagram profile scrapper with python

IG Profile Scrapper Instagram profile Scrapper Just type the username, and boo! :D Instalation clone this repo to your computer git clone https://gith

its Galih 6 Nov 07, 2022
Simply scrape / download all the media from an fansly account.

Simply scrape / download all the media from an fansly account. Providing updates as long as its continuously gaining popularity, so hit the ⭐ button!

Mika C. 334 Jan 01, 2023
fork huanghyw/jd_seckill

Jd_Seckill 特别声明: 本仓库发布的jd_seckill项目中涉及的任何脚本,仅用于测试和学习研究,禁止用于商业用途,不能保证其合法性,准确性,完整性和有效性,请根据情况自行判断。 本项目内所有资源文件,禁止任何公众号、自媒体进行任何形式的转载、发布。

512 Jan 03, 2023
Web scraper build using python.

Web Scraper This project is made in pyhthon. It took some info. from website list then add them into data.json file. The dependencies used are: reques

Shashwat Harsh 2 Jul 22, 2022
Scraping followers of an instagram account

ScrapInsta A script to scraping data from Instagram Install First of all you can run: pip install scrapinsta After that you need to install these requ

Matheus Kolln 1 Sep 05, 2021
A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

New to Streaming Scraper An in-progress web scraping project built with Python, R, and SQL. The scraped data are movie and TV show information. The go

Charles Dungy 1 Mar 28, 2022
Python Web Scrapper Project

Web Scrapper Projeto desenvolvido em python, sobre tudo com Selenium, BeautifulSoup e Pandas é um web scrapper que puxa uma tabela com as principais e

Jordan Ítalo Amaral 2 Jan 04, 2022
A Python module to bypass Cloudflare's anti-bot page.

cloudscraper A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Requests.

VeNoMouS 2.6k Dec 31, 2022
淘宝茅台抢购最新优化版本,淘宝茅台秒杀,优化了茅台抢购线程队列

淘宝茅台抢购最新优化版本,淘宝茅台秒杀,优化了茅台抢购线程队列

MaoTai 118 Dec 16, 2022
simple http & https proxy scraper and checker

simple http & https proxy scraper and checker

Neospace 11 Nov 15, 2021
Web scrapping tool written in python3, using regex, to get CVEs, Source and URLs.

searchcve Web scrapping tool written in python3, using regex, to get CVEs, Source and URLs. Generates a CSV file in the current directory. Uses the NI

32 Oct 10, 2022
A repository with scraping code and soccer dataset from understat.com.

UNDERSTAT - SHOTS DATASET As many people interested in soccer analytics know, Understat is an amazing source of information. They provide Expected Goa

douglasbc 48 Jan 03, 2023
Divar.ir Ads scrapper

Divar.ir Ads Scrapper Introduction This project first asynchronously grab Divar.ir Ads and then save to .csv and .xlsx files named data.csv and data.x

Iman Kermani 4 Aug 29, 2022
Google Developer Profile Badge Scraper

Google Developer Profile Badge Scraper It is a Google Developer Profile Web Scraper which scrapes for specific badges in a user's Google Developer Pro

Hemant Sachdeva 2 Feb 22, 2022
PyQuery-based scraping micro-framework.

demiurge PyQuery-based scraping micro-framework. Supports Python 2.x and 3.x. Documentation: http://demiurge.readthedocs.org Installing demiurge $ pip

Matias Bordese 109 Jul 20, 2022
抖音批量下载用户所有无水印视频

Douyincrawler 抖音批量下载用户所有无水印视频 Run 安装python3, 安装依赖

28 Dec 08, 2022
A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

Universal Online Judge Spider Introduction This is a spider for Universal Online Judge (UOJ) system (https://uoj.ac/). It also works for all other Onl

TriNitroTofu 1 Dec 07, 2021
Scrape and display grades onto the console

WebScrapeGrades About The Project This Project is a personal project where I learned how to webscrape using python requests. Being able to get request

Cyrus Baybay 1 Oct 23, 2021
一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件

QQ音乐歌词爬虫 一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件,默认去除了所有演唱会(Live)版本的歌曲。 使用方法 直接运行python run.py即可,然后输入你想获取的歌手名字,然后静静等待片刻。 output目录下保存生成的歌词和歌名文件。以周杰伦为例,会生成两

Yang Wei 11 Jul 27, 2022