Python script for finding duplicate images within a folder.

Last update: Dec 31, 2022

Overview

Duplicate Image Finder (DIF)

Tired of going through all images in a folder and comparing them manually to check if they are duplicates? The Duplicate Image Finder (DIF) for Python automates this task for you!

Description

The DIF searches for images in a specified target folder, compares the images it found and checks whether these are duplicates. The DIF then outputs the image files classified as duplicates and the filenames of the images having the lowest resolution, so you know which of the duplicate images are safe to be deleted. You can then either delete them manually, or let the DIF delete them for you.

Basic Usage

Use the following function to make DIF search for duplicates in the specified folder:

from difPy import compare_images

1

Test

test

Comments

run the CLI, how?

Hello,

call me stupid but I try to run the cli version of this code, I can run it from a basic script: from difPy import dif search = dif("C:/Path/to/Folder/")

and this works. but if I run it as python dif.py -A "C:/Path/to/Folder_A/"

I get a no such file or directory

And yes, not very familiar with python (yet)

Kind Regards,

Gerrit Kuilder
question

opened by GerritKuilder 4
Search results' keys are just names, but sometimes in sub-folders
Hi there! I have a folder like this:

folder/ | - IMG_202201.jpg | - IMG_202202.jpg | - subfolder/ | | - IMG_202203.jpg

and i use it as first arg

i noticed that difPy.dif() search results give me just the file name... without the subfolder anyhow noted :neutral_face:

this broke my script with FileNotFoundError: [Errno 2] No such file or directory
bug
opened by TheLastGimbus 4
PNGs with transparency are mistakenly counted as duplicate and not rendered properly in GUI compare

Great tool! I learned a lot reading the article you wrote about this as well.

I tested it on some of my files, but found that I had some PNGs that were just line-art (black line-art on transparent background) were flagged as duplicate when they were completely different, even on high sensitivity. In fact, the listed MSE is 0.00

They also did not render properly during the image comparison when running -d False, with both image previews looking like black squares. Note: This does not apply to line-art of a different color on transparent background, only black.

I am not familiar with how the PNG file format encodes black vs transparent, but I believe that the issue stems from that.

question

opened by SPRCoreDump 4

ValueError.

Hi there,

I'm trying to run this code on folder with more than 80k images:

Traceback (most recent call last):
  File ".\difpy.py", line 3, in <module>
    dif.compare_images("PATH TO FOLDER")
  File "C:\Users\user\.conda\envs\gan\lib\site-packages\difPy\dif.py", line 35, in compare_images
    imgs_matrix = dif.create_imgs_matrix(directory, px_size)
  File "C:\Users\user\.conda\envs\gan\lib\site-packages\difPy\dif.py", line 121, in create_imgs_matrix
    imgs_matrix = np.concatenate((imgs_matrix, img))
  File "<__array_function__ internals>", line 6, in concatenate
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 3 dimension(s) and the array at index 1 has 2 dimension(s)

what am i doing wrong?

Thanks in advance

bug

opened by rqtqp 4

Same duplicate in different keys

We have found that when you use dif within a folder of folders, there may be some unexpected behaviour. In our case, we have a pair of duplicates in one folder, and a third duplicate in another one. This makes it so result will output:

So an element that was detected as duplicate is being used later as a key. We do not know if this is bug or a feature, but it may be inconsistent with the behavior of not repeating duplicates in later keys. Still, for our use we can just use a set() as a workaround to ignore "duplicates of duplicates".

Nice work on the tool, it has helped us a lot with a nasty database. Thank you, have a nice day!

bug

opened by Fenho 3
Erroneous results on particular image set

I've been testing various image sets trying to isolate a bug and I got weird results on this one. There are no duplicates or similar images in this set. Similarity was to high. For example, the first result detected 32 duplicates with many of the files being listed more than once.

difPy output.zip

The image set can be downloaded here since it's to big to post. https://drive.google.com/file/d/1pbl7SttHF-mB35V1Q5ehj6A5wCb68o3B/view?usp=sharing
bug

opened by MarcG2 3
Match Single Image with Read-Only Directory

Dear Developer,

Am a noob but still love programming (have just started) so excuse me if anything below is "obvious" or "incorrectly stated".

I got the gist that this will match all files in the given directory for similarity.

First Point: Is it possible to match an image (file path to pass as parameter) against a directory path (folder path to pass as parameter)? Which Means that instead of Matching all Images against all images, we could match just one image against all images of a folder.

Second Point: Is the function writing something in the Search folder (like tensor Data or anything)? Am asking to understand if this can work in read-only directory or not. (I tried reading the code but could not figure it out)

Third point: If we have to run / call it multiple times on a large folder then would it be taking long time analyzing all files each time or is it possible to provide / pass a path to file / folder where it can save the analysis to save the time?

Example: (No text in below lines is crossed so please do not ignore if any text is coming crossed. I could not figure out why is it applying this formatting")

Input_file_path = "~/Downloads/image.jpg" # Any valid Image File Target_Folder_path = "~/A_Readonly_Folder_of_Images" # A Read-only folder with say 56,000 (big number ?) files to search from. Working_File_or_Folder_path: "~/A_File_or_Folder_with_Read_Write_Access" # A Write access enabled file / folder to save analysis data to / from. E.g. If the passed parameter file / folder does not exist then create one and save analysis data. If the passed parameter file / folder does exist then read it and use it instead of analyzing the Target Folder again #calling dif.compare_image(Input_file_path,Target_folder_path,Working_Folder_path)

Please excuse me if am crossing any limits here. I just became curious about this wonderful concept but I know nothing about github and how it works.

Best Regards Ashish
question

opened by ashish128 3
[CHANGE REQUEST] replacing 'output directory' with 'move_path'
Hello. first of all I would like to thank you for creating and maintaining this project. It has certainly helped me finding a bunch of duplicate images through my enormous gallery.

I discovered this project 3/4 months ago. I needed a way for difPy.py to move my duplicate images to certain directories, but it was not possible. I edited the source code - which was really easy, having little to no Python experience prior to this.

As I recently wanted to make a pull request, I noticed that this repository had been updated, which meant that I had to update my version as well. Along with the updates, I noticed a new output_directory flag, which was only useful if using this program through the command line. I made my changes and would like to introduce my implementation.

Instead of the (now present) output_directory flag, I added move, silent_move and move_path as parameters to the __init__ function. Here are the details:

Their default values are (of course) false

move and silent_move would be further passed to the _validate_parameters() function

After processing directory_A and directory_B, if move was set to true, the move_path would be validated - checked if it was equal to directory_A and/or directory_B, and it would be further passed to the _process_directory() function

An appropriate prompt for the silent_move parameter

In the _validate_parameters() function, move and delete can not be both true, as well as move and silent_move accepting only boolean values

A _move_imgs() funcion, similar to _delete_imgs(), with appropriate behavior

-m, --move, -M, --silent_move, -mp and --move-path CLI flags

The currently implemented output_directory flag only works for the CLI, but not for python scripts, as it is not passed over to the __init__ funcion. As a result, I have removed the output_directory flag and replaced it with my move implementation. This version takes both the command line and scripts in mind.

I would be happy to submit a pull request with my changes, If this idea sounds good to you, so you can take a better look at how these changes would be implemented.

Looking forward to collaborating and contributing to this project as much as I can.
new feature out of scope
opened by bojanmilevski 2
Near duplicate Image detection

Hello, first of all thanks for creating this package It is really good package for detecting Duplicate images. I have tried this package I have found that it is able to detect images which are 100% similarity but I have found that it was not able to detect the images when similarity is not 100% even if similarity is 99.99% or less not able to detect image. I have tried to play with the pixel values and similarity but than also it was not able to detect. So, is there ways to detect such image which having similarity score less than 100% by using difpy package.

I have attached few images which it was not able to detect. Note:- The percentage values which I have refereed many times found from matchTemplate method the images which are attached having similarity is 99%.

question

opened by dhruvbhatnagar9548 2

search in Sub directories

Hi Elise!

Thank you for existing!

My Onedrive duplicated my library about 4years ago, that and countless backups from WhatsApp and messager, A 550GB mess, yeah you get the point.

I'm really new to coding and git so figure ill postcode instead, it's not clean but I'm pressed on time studying applied data science and working as a product manager.

I have a few more ideas, but the code below was necessary for me right now :)

Code finds photos in all subdirectories (folder in a folder) in the given file paths. Code I have added is commented: #added by Kristofer from #added by Kristofer to

`import skimage.color import matplotlib.pyplot as plt import numpy as np import cv2 import os import imghdr import time import collections #added kristofer from pathlib import Path

class dif:

def __init__(self, directory_A, directory_B = None, similarity="normal", px_size=50, sort_output=False, show_output=False, delete=False, silent_del=False):
    """
    directory_A (str)......folder path to search for duplicate/similar images
    directory_B (str)....second folder path to search for duplicate/similar images
    similarity (str)....."normal" = searches for duplicates, recommended setting, MSE < 200
                         "high" = serached for exact duplicates, extremly sensitive to details, MSE < 0.1
                         "low" = searches for similar images, MSE < 1000
    px_size (int)........recommended not to change default value
                         resize images to px_size height x width (in pixels) before being compared
                         the higher the pixel size, the more computational ressources and time required 
    sort_output (bool)...False = adds the duplicate images to output dictionary in the order they were found
                         True = sorts the duplicate images in the output dictionars alphabetically 
    show_output (bool)...False = omits the output and doesn't show found images
                         True = shows duplicate/similar images found in output            
    delete (bool)........! please use with care, as this cannot be undone
                         lower resolution duplicate images that were found are automatically deleted
    silent_del (bool)....! please use with care, as this cannot be undone
                         True = skips the asking for user confirmation when deleting lower resolution duplicate images
                         will only work if "delete" AND "silent_del" are both == True
    
    OUTPUT (set).........a dictionary with the filename of the duplicate images 
                         and a set of lower resultion images of all duplicates
    """
    start_time = time.time()

   
    if directory_B != None:
        # process both directories
        dif._process_directory(directory_A)
        dif._process_directory(directory_B)
    else:
        # process one directory
        dif._process_directory(directory_A)
        directory_B = directory_A

    all_directories_A = [directory_A]
    all_directories_B = [directory_B]

    #added by Kristofer from
    for path in Path(directory_A).iterdir():
        if path.is_dir():
            all_directories_A.append(path)

    for path in Path(directory_B).iterdir():
        if path.is_dir():
            all_directories_B.append(path)
    
    dif._validate_parameters(sort_output, show_output, similarity, px_size, delete, silent_del)

    for dif_A in all_directories_A:
        for dif_B in all_directories_B:

            directory_A = str(dif_A)
            directory_B = str(dif_B)
    #added by Kristofer to                    
                   
            if directory_B == directory_A:
                result, lower_quality = dif._search_one_dir(directory_A, 
                                                                similarity, px_size, sort_output, show_output, delete)
            else:
                result, lower_quality = dif._search_two_dirs(directory_A, directory_B, 
                                                                similarity, px_size, sort_output, show_output, delete)
                if len(lower_quality) != len(set(lower_quality)):
                    print("DifPy found that there are duplicates within directory A.")
                    
            if sort_output == True:
                result = collections.OrderedDict(sorted(result.items()))
            
            time_elapsed = np.round(time.time() - start_time, 4)
            
            self.result = result
            self.lower_quality = lower_quality
            self.time_elapsed = time_elapsed
            
            if len(result) == 1:
                images = "image"
            else:
                images = "images"
            print("Found", len(result), images, "with one or more duplicate/similar images in", time_elapsed, "seconds.")
            
            if len(result) != 0:
                if delete:
                    if not silent_del:
                        usr = input("Are you sure you want to delete all lower resolution duplicate images? \nThis cannot be undone. (y/n)")
                        if str(usr) == "y":
                            dif._delete_imgs(set(lower_quality))
                        else:
                            print("Image deletion canceled.")
                    else:
                        dif._delete_imgs(set(lower_quality))

                
        
def _search_one_dir(directory_A, similarity="normal", px_size=50, sort_output=False, show_output=False, delete=False):
    
    img_matrices_A, filenames_A = dif._create_imgs_matrix(directory_A, px_size)
    result = {}
    lower_quality = []   
    
    ref = dif._map_similarity(similarity)
    
    # find duplicates/similar images within one folder
    for count_A, imageMatrix_A in enumerate(img_matrices_A):
        for count_B, imageMatrix_B in enumerate(img_matrices_A):
            if count_B != 0 and count_B > count_A and count_A != len(img_matrices_A):      
                rotations = 0
                while rotations <= 3:
                    if rotations != 0:
                        imageMatrix_B = dif._rotate_img(imageMatrix_B)

                    err = dif._mse(imageMatrix_A, imageMatrix_B)
                    if err < ref:
                        if show_output:
                            dif._show_img_figs(imageMatrix_A, imageMatrix_B, err)
                            dif._show_file_info(str("..." + directory_A[-35:]) + "/" + filenames_A[count_A], 
                                               str("..." + directory_A[-35:]) + "/" + filenames_A[count_B])
                        if filenames_A[count_A] in result.keys():
                            result[filenames_A[count_A]]["duplicates"] = result[filenames_A[count_A]]["duplicates"] + [directory_A + "/" + filenames_A[count_B]]
                        else:
                            result[filenames_A[count_A]] = {"location" : directory_A + "/" + filenames_A[count_A],
                                                                "duplicates" : [directory_A + "/" + filenames_A[count_B]]
                                                               }
                        high, low = dif._check_img_quality(directory_A, directory_A, filenames_A[count_A], filenames_A[count_B])
                        lower_quality.append(low)                         
                        break
                    else:
                        rotations += 1    
    if sort_output == True:
        result = collections.OrderedDict(sorted(result.items()))
    return result, lower_quality            

def _search_two_dirs(directory_A, directory_B = None, similarity="normal", px_size=50, sort_output=False, show_output=False, delete=False):

    img_matrices_A, filenames_A = dif._create_imgs_matrix(directory_A, px_size)
    img_matrices_B, filenames_B = dif._create_imgs_matrix(directory_B, px_size)
    
    result = {}
    lower_quality = []   
    
    ref = dif._map_similarity(similarity)
        
    # find duplicates/similar images between two folders
    for count_A, imageMatrix_A in enumerate(img_matrices_A):
        for count_B, imageMatrix_B in enumerate(img_matrices_B):
            rotations = 0
            #print(count_A, count_B)
            while rotations <= 3:

                if rotations != 0:
                    imageMatrix_B = dif._rotate_img(imageMatrix_B)
                    
                err = dif._mse(imageMatrix_A, imageMatrix_B)
                #print(err)
                if err < ref:
                    if show_output:
                        dif._show_img_figs(imageMatrix_A, imageMatrix_B, err)
                        dif._show_file_info(str("..." + directory_A[-35:]) + "/" + filenames_A[count_A], 
                                           str("..." + directory_B[-35:]) + "/" + filenames_B[count_B])
                    
                    if filenames_A[count_A] in result.keys():
                        result[filenames_A[count_A]]["duplicates"] = result[filenames_A[count_A]]["duplicates"] + [directory_B + "/" + filenames_B[count_B]]
                    else:
                        result[filenames_A[count_A]] = {"location" : directory_A + "/" + filenames_A[count_A],
                                                            "duplicates" : [directory_B + "/" + filenames_B[count_B]]
                                                           }
                    high, low = dif._check_img_quality(directory_A, directory_B, filenames_A[count_A], filenames_B[count_B])
                    lower_quality.append(low)                         
                    break
                else:
                    rotations += 1    
            
    if sort_output == True:
        result = collections.OrderedDict(sorted(result.items()))
    return result, lower_quality

def _process_directory(directory):
    # check if directories are valid
    directory += os.sep
    if not os.path.isdir(directory):
        raise FileNotFoundError(f"Directory: " + directory + " does not exist")
    return directory

def _validate_parameters(sort_output, show_output, similarity, px_size, delete, silent_del):
    # validate the parameters of the function
    if sort_output != True and sort_output != False:
        raise ValueError('Invalid value for "sort_output" parameter.')
    if show_output != True and show_output != False:
        raise ValueError('Invalid value for "show_output" parameter.')
    if similarity not in ["low", "normal", "high"]:
        raise ValueError('Invalid value for "similarity" parameter.')
    if px_size < 10 or px_size > 5000:
        raise ValueError('Invalid value for "px_size" parameter.')
    if delete != True and delete != False:
        raise ValueError('Invalid value for "delete" parameter.')   
    if silent_del != True and silent_del != False:
        raise ValueError('Invalid value for "silent_del" parameter.')   

def _create_imgs_matrix(directory, px_size):
    directory = dif._process_directory(directory)
    img_filenames = []
    # create list of all files in directory     
    folder_files = [filename for filename in os.listdir(directory)]

    # create images matrix   
    imgs_matrix = []
    for filename in folder_files:
        path = os.path.join(directory, filename)
        # check if the file is not a folder
        if not os.path.isdir(path):
            try:
                img = cv2.imdecode(np.fromfile(path, dtype=np.uint8), cv2.IMREAD_UNCHANGED)
                if type(img) == np.ndarray:
                    img = img[..., 0:3]
                    img = cv2.resize(img, dsize=(px_size, px_size), interpolation=cv2.INTER_CUBIC)
                    
                    if len(img.shape) == 2:
                        img = skimage.color.gray2rgb(img)
                    imgs_matrix.append(img)
                    img_filenames.append(filename)
            except:
                pass
    return imgs_matrix, img_filenames

def _map_similarity(similarity):
    if similarity == "low":
        ref = 1000
    # search for exact duplicate images, extremly sensitive, MSE < 0.1
    elif similarity == "high":
        ref = 0.1
    # normal, search for duplicates, recommended, MSE < 200
    else:
        ref = 200
    return ref

# Function that calulates the mean squared error (mse) between two image matrices
def _mse(imageA, imageB):
    err = np.sum((imageA.astype("float") - imageB.astype("float")) ** 2)
    err /= float(imageA.shape[0] * imageA.shape[1])
    return err

# Function that plots two compared image files and their mse
def _show_img_figs(imageA, imageB, err):
    fig = plt.figure()
    plt.suptitle("MSE: %.2f" % (err))
    # plot first image
    ax = fig.add_subplot(1, 2, 1)
    plt.imshow(imageA, cmap=plt.cm.gray)
    plt.axis("off")
    # plot second image
    ax = fig.add_subplot(1, 2, 2)
    plt.imshow(imageB, cmap=plt.cm.gray)
    plt.axis("off")
    # show the images
    plt.show()
    
# Function for printing filename info of plotted image files
def _show_file_info(imageA, imageB):
    print("""Duplicate files:\n{} and \n{}
    
    """.format(imageA, imageB))
    
# Function for rotating an image matrix by a 90 degree angle
def _rotate_img(image):
    image = np.rot90(image, k=1, axes=(0, 1))
    return image

# Function for checking the quality of compared images, appends the lower quality image to the list
def _check_img_quality(directoryA, directoryB, imageA, imageB):
    dirA = dif._process_directory(directoryA)
    dirB = dif._process_directory(directoryB)
    size_imgA = os.stat(dirA + imageA).st_size
    size_imgB = os.stat(dirB + imageB).st_size
    if size_imgA >= size_imgB:
        return directoryA + "/" + imageA, directoryB + "/" + imageB
    else:
        return directoryB + "/" + imageB, directoryA + "/" + imageA
    
# Function for deleting the lower quality images that were found after the search    
def _delete_imgs(lower_quality_set):
    deleted = 0
    for file in lower_quality_set:
        print("\nDeletion in progress...", end = "\r")
        try:
            os.remove(file)
            print("Deleted file:", file, end = "\r")
            deleted += 1
        except:
            print("Could not delete file:", file, end = "\r")
    print("\n***\nDeleted", deleted, "images.")

new feature

opened by DeyoSwed 2

Local variable 'imgs_matrix' referenced before assignment

Hello,

I get this error while trying to run this simple line from your package (the import works). Some help would be very welcome.

UnboundLocalError: local variable 'imgs_matrix' referenced before assignment

bug

opened by Tesax123 2
Refactoring - Optional Merge
Hi Elise :wave:

first of all, cool idea! I recently needed to compare a large chunks of images and your approach for comparing them worked pretty well :+1:

That being said, in the current implementation it is rather slow. Comparing larger chunks of images (15000+) takes a while. Moreover, you use a lot of different dependencies where some of them are quite large (e.g. opencv). This makes it difficult to install the tool in specific environments like within a Docker container.

Since I probably need to compare images in future again, I thought of improving these issues. This pull request provides the results. Before talking about the changes, let me apologize for the huge pull request. I actually do not like larger pull requests for my own repos and prevent from doing them to other persons as well. However, the dependency changes and especially the multiprocessing required a larger restructuring of your tool. Therefore, I totally understand if you do not want to merge the changes. In this case, I'm fine with maintaining a fork of your repository that provides an alternative implementation. Just decide as you like :)

Here is a brief summary of the changes I made:

Make a clearer cut between CLI and library. The CLI script is now contained in /bin/difpy, while the code in /difPy/difPy.py only contains the library implementation.

Reduce dependencies. The whole technique you describe can be implemented using numpy and Pillow. This makes it possible to create a Docker container running difPy that has only 161MB. Before, with opencv, we were around 1.2GB.

Add multiprocessing. Work can now be distributed between different cores, which should speed up the operation quite a bit for larger image sets.

Add a fast compare option. When image A is similar to image B, one probably does not want to compare B to other images, but is fine with only comparing A with others from here. Sure, this may misses some edge case duplicates, but in most situations it should be fine and provides a huge speedup for the operation.

Change the command line layout. Feels now more intuitive (at least to me :D)

Change the output format. The output format is still JSON based, but does not include much statistic information now. The regular end user is probably not that interested on when a comparison took place, but more on the actual comparison result. The new reduced output format should be easier to read / parse.

Add a Dockerfile for building a container running difpy.

As I said, many changes. Just think about whether you want to merge or whether we keep these changes in a separate fork. I'm fine with both approaches :wink:

Best Tobias
new feature
opened by qtc-de 1
Multi-processing

I am currently working on making this project multithreaded, as I have many folders with tens of thousands of images(perhaps 100k+), and am wanting a slightly faster option.

Opening this as a means of communication. If you have a discord account/email that would work better, as I will likely see that before a github issue comment.
My discord account is thecodingchicken#4835 if you would prefer to reach out there.
new feature

opened by thecodingchicken 3
Multi-threading

Hi! I have nice AMD cpu with 8 cores. And when I'm searching thorough 2 big folders, it takes a lof of time because only one of them is being use

Dividing the work into multiple threads seems as obvious task in this library - would be awesome if you implemented it! (or suggested how it could be done for someone to pull request)
new feature

opened by TheLastGimbus 1
feature request: chunking of source folder

Thank you for your library! Just giving a heads up that I edited one of your previous versions by adding an additional parameter that allows the src folder to be split into n chunks for processing. Scenario: I have image folders that contain over 50000 images in sequential time over.

For me, it is most likely that an image file is going to be a duplicate with other image files added around a similar time frame. Comparing against the entire 50000+ for each image took an enormous amount of time. So, I made it so that I could split the folder into chunks of 5000 (for example) and evaluate in sections. It also allowed me to restart from a position if I had to stop evaluation for some reason. There's a little more that I added to make it more robust (for example, for n+1 chunk would also include some amount of files from the previous chunk so that there would be some degree of overlap). Anyway, this worked out well for me and if you are still adding to this library then I found it to be very useful.

The route I took is not going to be as robust as going through EVERY image each time but in my personal tests, the performance was close enough and the time savings were significant! Cheers,
new feature

opened by ALCarter2 1

Releases(v2.4.5)

v2.4.5(Jan 1, 2023)
Major updates and bug fixes:

Fixed issue #42 where duplicate files in subfolders would be added twice to the search.result output dictionary

@stberg-os implemented the feature to disable recursive search: search within subfolders can now be turned off

Various other minor code updates

Full Changelog: https://github.com/elisemercury/Duplicate-Image-Finder/compare/v2.4.4...v2.4.5
Source code(tar.gz)
Source code(zip)
v2.4.4(Aug 25, 2022)
Major code improvements & fixes

Fixed issue #37 where black and white images would not be correctly decoded.

Fixed issue where command line parameter -s / -similarity would not accept integers as input

Various other fixes in the code

Full Changelog: https://github.com/elisemercury/Duplicate-Image-Finder/compare/v2.4.3...v2.4.4
Source code(tar.gz)
Source code(zip)
v2.4.3(Aug 24, 2022)
Please update to a higher version as a major issue was found in v2.4.3.

Major bug fix

Fixed issue #37 which caused difPy's output to be inaccurate.

Full Changelog: https://github.com/elisemercury/Duplicate-Image-Finder/compare/v2.4.2...v2.4.3
Source code(tar.gz)
Source code(zip)
v2.4.2(Aug 21, 2022)
Please update to a higher version as a major issue was found in v2.4.2.

Bug fixes & minor code improvements

Fixed issue #33 where files with same filename and different folder would be put under the same key in the output results dictionary

Removed sort_output parameter as it became obsolete with the above fix

Support for setting the MSE threshold for comparison directly from the similarity parameter

Implemented handling for issue #32 where CTRL-C would not abort the difPy process when running in a terminal

Various other code improvements

Full Changelog: https://github.com/elisemercury/Duplicate-Image-Finder/compare/v2.4.1...v2.4.2
Source code(tar.gz)
Source code(zip)
v2.4.1(Jul 10, 2022)
Minor code updates and bug fixes

Changed show progress parameter to default True: the progress bar of difPy will be shown by default

Added -Z / -output_directory parameter to CLI interface: allows to set the output folder of the result files

More detailed progress tracking: progress bar is shown when difPy is preparing the files in the target folder(s), and when difPy is comparing the images

Fixed an issue where search in subfolders was imprecise

@ethanmann fixed issue #25

Minor other code adjustments and bug fixes

Full Changelog: https://github.com/elisemercury/Duplicate-Image-Finder/compare/v2.4...v2.4.1
Source code(tar.gz)
Source code(zip)
v2.4(Jun 30, 2022)
Major new features and code improvements:

Enhancement #12 and #18: added support for search within subfolders

Enhancement #11: added support for usage through CLI interface

Improved path handling of files to be os-independent

Various minor code updates

Full Changelog: https://github.com/elisemercury/Duplicate-Image-Finder/compare/v2.3...v2.4
Source code(tar.gz)
Source code(zip)
v2.3(Jun 29, 2022)
New features and code improvements:

Enhancement https://github.com/elisemercury/Duplicate-Image-Finder/pull/19: added support for a progress bar to track the process of difPy

Enhancement https://github.com/elisemercury/Duplicate-Image-Finder/pull/20: added support for generation of statistics on the difPy process

Fixed bug #17 which caused a FileNotFoundError when files where moved/deleted while difPy is running

Various updates & improvements to the code

Full Changelog: https://github.com/elisemercury/Duplicate-Image-Finder/compare/v2.2...v2.3
Source code(tar.gz)
Source code(zip)
v2.2(Mar 6, 2022)
Minor updates to v2.0:

Various updates & improvements to the code

Support for silent deletion of images

Source code(tar.gz)
Source code(zip)
v2.0(Dec 26, 2021)
Major code updates and various new features added:

Runtime of difPy v2.0 is 6x faster than its previous versions

Support for search within two different folders

Support for sorting of output by filename alphabetically

Optimization and implementation of error handling

Various other code improvements

Source code(tar.gz)
Source code(zip)
v1.2(Nov 10, 2021)

Updates & bug fixes to the code.

Fixed the issue where black and white images were not processed correctly.
Source code(tar.gz)
Source code(zip)
v1.0.0(Oct 30, 2021)
Various updates to the code.

New features:

Automatically delete the lower resolution duplicate files that were found

Addition of a new similarity-level at which images are compared: now 3 levels can be chosen ("low", "normal" and "high")

Upload as package to PyPI.org
Source code(tar.gz)
Source code(zip)
v0.0(Oct 30, 2021)

First initial release.
Source code(tar.gz)
Source code(zip)

Owner

Technical Solutions Specialist @ Cisco Systems

GitHub Repository

PwnWiki Telegram database searching bot

pwtgbot PwnWiki Telegram database searching bot. Screenshots How it looks like in the terminal when running How it looks like in Telegram Run Directly

3 Jan 25, 2022

High level Python client for Elasticsearch

Elasticsearch DSL Elasticsearch DSL is a high-level library whose aim is to help with writing and running queries against Elasticsearch. It is built o

3.6k Dec 30, 2022

Pysolr — Python Solr client

pysolr pysolr is a lightweight Python client for Apache Solr. It provides an interface that queries the server and returns results based on the query.

626 Dec 01, 2022

An open source, non-profit search engine implemented in python

Mwmbl: No ads, no tracking, no cruft, no profit Mwmbl is a non-profit, ad-free, free-libre and free-lunch search engine with a focus on useability and

639 Jan 04, 2023

Super Simple Similarities Service

95 Dec 25, 2022

Google Project: Search and auto-complete sentences within given input text files, manipulating data with complex data-structures.

Auto-Complete Google Project In this project there is an implementation for one feature of Google's search engines - AutoComplete. Autocomplete, or wo

10 Jun 20, 2022

A web search server for ParlAI, including Blenderbot2.

Description A web search server for ParlAI, including Blenderbot2. Querying the server: The server reacting correctly: Uses html2text to strip the mar

119 Jan 06, 2023

基于RSSHUB阅读器实现的获取P站排行和P站搜图，使用时需使用代理

基于RSSHUB阅读器实现的获取P站排行和P站搜图

34 Dec 05, 2022

This project is a sample demo of Arxiv search related to AI/ML Papers built using Streamlit, sentence-transformers and Faiss.

49 Oct 30, 2022

Python script for finding duplicate images within a folder.

Related tags

Overview

Duplicate Image Finder (DIF)

Description

Basic Usage

1

Comments

Releases(v2.4.5)

v2.4.5(Jan 1, 2023)

v2.4.4(Aug 25, 2022)

v2.4.3(Aug 24, 2022)

v2.4.2(Aug 21, 2022)

v2.4.1(Jul 10, 2022)

v2.4(Jun 30, 2022)

v2.3(Jun 29, 2022)

v2.2(Mar 6, 2022)

v2.0(Dec 26, 2021)

v1.2(Nov 10, 2021)

v1.0.0(Oct 30, 2021)

v0.0(Oct 30, 2021)

Owner

PwnWiki Telegram database searching bot

High level Python client for Elasticsearch

Pysolr — Python Solr client

An open source, non-profit search engine implemented in python

Super Simple Similarities Service

Google Project: Search and auto-complete sentences within given input text files, manipulating data with complex data-structures.

A web search server for ParlAI, including Blenderbot2.

基于RSSHUB阅读器实现的获取P站排行和P站搜图，使用时需使用代理

This project is a sample demo of Arxiv search related to AI/ML Papers built using Streamlit, sentence-transformers and Faiss.

esguard provides a Python decorator that waits for processing while monitoring the load of Elasticsearch.

Whoosh indexing capabilities for Flask-SQLAlchemy, Python 3 compatibility fork.

ForFinder is a search tool for folder and files

A real-time tech course finder, created using Elasticsearch, Python, React+Redux, Docker, and Kubernetes.

Yet another googlesearch - A Python library for executing intelligent, realistic-looking, and tunable Google searches.

A fast, efficiency python package for searching and getting search results with many different search engines

A library for fast import of Windows NT Registry(REGF) into Elasticsearch.

GitScanner is a script to make it easy to search for Exposed Git through an advanced Google search.

Jina allows you to build deep learning-powered search-as-a-service in just minutes

User-friendly, tiny source code searcher written by pure Python.

document organizer with tags and full-text-search, in a simple and clean sqlite3 schema