UDdup - URLs Deduplication Tool

Overview

UDdup - URLs Deduplication Tool

The tool gets a list of URLs, and removes "duplicate" pages in the sense of URL patterns that are probably repetitive and points to the same web template.

For example:

https://www.example.com/product/123
https://www.example.com/product/456
https://www.example.com/product/123?is_prod=false
https://www.example.com/product/222?is_debug=true

All the above are probably points to the same product "template". Therefore it should be enough to scan only some of these URLs by our various scanners.

The result of the above after UDdup should be:

https://www.example.com/product/123?is_prod=false
https://www.example.com/product/222?is_debug=true

Why do I need it?

Mostly for better (automated) reconnaissance process, with less noise (for both the tester and the target).

Examples

Take a look at demo.txt which is the raw URLs file which results in demo-results.txt.


Installation

With pip (Recommended)

pip install uddup

Manual (from code)

# Clone the repository.
git clone https://github.com/rotemreiss/uddup.git

# Install the Python requirements.
cd uddup
pip install -r requirements.txt

Usage

uddup -u demo.txt -o ./demo-result.txt

More Usage Options

uddup -h

Short Form Long Form Description
-h --help Show this help message and exit
-u --urls File with a list of urls
-o --output Save results to a file
-s --silent Print only the result URLs
-fp --filter-path Filter paths by a given Regex

Filter Paths by Regex

Allows filtering custom paths pattern. For example, if we would like to filter all paths that starts with /product we will need to run:

# Single Regex
uddup -u demo.txt -fp "^product"

Input:

https://www.example.com/
https://www.example.com/privacy-policy
https://www.example.com/product/1
https://www.example2.com/product/2
https://www.example3.com/product/4

Output:

https://www.example.com/
https://www.example.com/privacy-policy

Advanced Regex with multiple path filters

uddup -u demo.txt -fp "(^product)|(^category)"

Contributing

Feel free to fork the repository and submit pull-requests.


Support

Create new GitHub issue

Want to say thanks? :) Message me on Linkedin


License

License

Comments
  • cant run uddup

    cant run uddup

    This tool is so great and really useful but i have noticed you will move the uddup execute script to /usr/local/bin directory which is actually doesnt work some times because its needs to be in /usr/bin directory to be executed i dont know why... I copied the script to /usr/bin and its worked perfectly. Im using kali linux subsystem on windows 11. Sorry if theres a problem with my issue report, its my first issue report on github :V Thanks.

    question 
    opened by siratsami 2
  • Multiple hostnames (domains) which shares the same patterns conflicts

    Multiple hostnames (domains) which shares the same patterns conflicts

    I found out that I missed a very basic case like:

    https://www.example.com/product/123
    https://www.example2.com/product/123
    

    This currently results in one URL instead of two:

    https://www.example.com/product/123
    ```.
    bug 
    opened by rotemreiss 1
  • fix bug with unicode char in urls

    fix bug with unicode char in urls

    This fixes a problem with URLs with UTF8 chars, e.g:

    echo "http://www.shakedos.com:80/index.php/2010/05/עבודה-עם-שפות-ללא-טבלאות-מוכנות/feed/" > /tmp/urls.txt
    uddup -u /tmp/urls.txt 
    ...
    Traceback (most recent call last):
      File "/usr/local/bin/uddup", line 11, in <module>
        sys.exit(interactive())
      File "/usr/local/lib/python3.5/dist-packages/uddup/main.py", line 269, in interactive
        main(args.urls_file, args.output, args.silent, args.filter_path)
      File "/usr/local/lib/python3.5/dist-packages/uddup/main.py", line 184, in main
        for url in f:
      File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 100: ordinal not in range(128
    
    bug good first issue 
    opened by Shaked 0
  • Support paths filtering by Regex

    Support paths filtering by Regex

    Support paths filtering by Regex, as unew does.

    Requirements

    • Support custom regex to be provided by the user

    Known limitations

    • Only the path will be filtered while ignoring the hostname and parameters (may be extended in the future)
    enhancement 
    opened by rotemreiss 0
  • [request] - de-duplicate similar paths

    [request] - de-duplicate similar paths

    Hi Rotem,

    Currently, uddup is not able to de-duplicate similar paths like below.

    /users/122/edit
    /users/123/edit
    

    image

    This project https://github.com/ameenmaali/urldedupe trying to solve similar problems is able to de-duplicate them. The only issue is since it's written in C++ it requires rebuilding binary for a new machine.

    -- Regards, @bugbaba

    enhancement 
    opened by bugbaba 1
Releases(v0.9.3)
Owner
Rotem Reiss
Rotem Reiss
Astra is a tool to find URLs and secrets.

Astra finds urls, endpoints, aws buckets, api keys, tokens, etc from a given url/s. It combines the paths and endpoints with the given domain and give

Stinger 198 Dec 27, 2022
coURLan: Clean, filter, normalize, and sample URLs

coURLan: Clean, filter, normalize, and sample URLs Why coURLan? “Given that the bandwidth for conducting crawls is neither infinite nor free, it is be

Adrien Barbaresi 20 Dec 14, 2022
Yet another URL library

Yet another URL library

aio-libs 884 Jan 03, 2023
A tool programmed to shorten links/mask links

A tool programmed to shorten links/mask links

Anontemitayo 6 Dec 02, 2022
python3 flask based python-url-shortener microservice.

python-url-shortener This repository is for managing all public/private entity specific api endpoints for an organisation. In this case we have entity

Asutosh Parida 1 Oct 18, 2021
🔗 Generate Phishing URLs 🔗

URLer 🔗 Generate Phishing URLs 🔗 URLer Table Of Contents General Information Preview Installation Disclaimer Credits Social Media Bug Report General

mrblackx 5 Feb 08, 2022
Shorten-Link - Make shorten URL with Cuttly API

Shorten-Link This Script make shorten URL with custom slashtag The script take f

Ahmed Hossam 3 Feb 13, 2022
Simple python library to deal with URI Templates.

uritemplate Documentation -- GitHub -- Travis-CI Simple python library to deal with URI Templates. The API looks like from uritemplate import URITempl

Hyper 210 Dec 19, 2022
A url redirect status check module for python

A url redirect status check module for python

Fayas Noushad 2 Oct 24, 2021
🔗 FusiShort is a URL shortener built with Python, Redis, Docker and Kubernetes

This is a playground application created with goal of applying full cycle software development using popular technologies like Python, Redis, Docker and Kubernetes.

Lucas Fusinato Zanis 7 Nov 10, 2022
A simple, immutable URL class with a clean API for interrogation and manipulation.

purl - A simple Python URL class A simple, immutable URL class with a clean API for interrogation and manipulation. Supports Pythons 2.7, 3.3, 3.4, 3.

David Winterbottom 286 Jan 02, 2023
declutters url lists for crawling/pentesting

uro Using a URL list for security testing can be painful as there are a lot of URLs that have uninteresting/duplicate content; uro aims to solve that.

Somdev Sangwan 677 Jan 07, 2023
This is a no-bullshit file hosting and URL shortening service that also runs 0x0.st. Use with uWSGI.

This is a no-bullshit file hosting and URL shortening service that also runs 0x0.st. Use with uWSGI.

mia 1.6k Dec 31, 2022
Extract countries, regions and cities from a URL or text

This project is no longer being maintained and has been archived. Please check the Forks list for newer versions. Forks We are aware of two 3rd party

Ushahidi 216 Nov 18, 2022
Simple Version of ouo.io. shorten any link on the web easily

OUO.IO LINK SHORTENER This is a simple python script that made to short links. currently ouo.io doesn't have Application Programming Interface so i de

Danushka-Madushan 1 Dec 11, 2021
a little project to make custom discord invites over a url

custom-dc-invite a little project to make custom discord invites over a url how it works you create a account for

baum1810 2 Oct 03, 2022
find all the URL of a site with a specific Regex

href this program will find all the link with a spesfic Regex pattern from a site. what it will do in any site there are a lots of url that may you ne

Arya Shabane 12 Dec 05, 2022
C++ library for urlencode.

liburlencode C library for urlencode.

Khaidi Chu 6 Oct 31, 2022
A simple URL shortener app using Python AWS Chalice, AWS Lambda and AWS Dynamodb.

url-shortener-chalice A simple URL shortener app using AWS Chalice. Please make sure you configure your AWS credentials using AWS CLI before starting

Ranadeep Ghosh 2 Dec 09, 2022
A URL builder for genius :D

genius-url A URL builder for genius :D Usage from gurl import genius_url

ꌗᖘ꒒ꀤ꓄꒒ꀤꈤꍟ 12 Aug 14, 2021