[Utility][Python] DDFF - Duplicate Directories and Files Finder

This is my old utility I wrote in June 2018. Left almost intact. I still use it. Download.

5-Jan-2020: updated to Python 3.x.

Also, works fine under Cygwin.

# DDFF - Duplicate Directories and Files Finder

... by dennis(a)yurichev.com

Written in Python 3.x.

So far, only for Linux, sorry.

Finds both duplicate files and directories.

## For users

Just run it:

	./ddff.py ~


	python ddff.py ~

Multiple directories allowed:

	./ddff.py ~/Music /mnt/external_HDD/Music

Here is an example: the script processed Linux kernel source trees for versions, 2.6.26, 2.6.31, 3.10.43, 3.18.37, 3.2.1, 4.11, 4.1.22.

* dir size=5.9MB
* dir size=3.5MB
* dir size=2.3MB
* dir size=1.8MB
* dir size=1.7MB
* dir size=1.1MB
* file size=1.1MB
* file size=1.0MB


( linux_ddff.txt )

Now you can see what hasn't been modified across several Linux kernel versions (larger than 100KB).

By default, only files/directories larger than 1MB are dumped.
Modify the LIMIT variable in ddff.py to change this.

## Internals, etc

Files are not compared as a wholes, rather 5 4Kb consecutive spots are taken from it and then hashed using SHA256.
Hashes are then compared.
Surely, file/directory names are not compared.
If entropy of these 5 spots are suspiciously low (less than 7 bits per byte), the whole file is hashed.
Read more about entropy in my ["Reverse Engineering for Beginners"](https://beginners.re/) book.
If you feel paranoid, turn on "PARANOID" option in the ddff.py file, and full hashes will be calculated for each file (this is just slow, especially for videos).

Rationale: for compressed files, only these 5 4KB spots are seems to be enough, maybe even less.
However, a patched byte(s) in low-entropy text/executable file can be located between spots and 
files would be treated as similar, erroneously.

Directories are compared using Merkle trees,
read [here](compare_two_folders.md) about my short example, what this is.
Merkle trees are also used in torrents and blockchains.
I.e., SHA256 hash is also calculated for all directories.

File hashes are then stored (serialized) into ddff.db file
(Python's [pickle library](https://docs.python.org/3/library/pickle.html) is used).
This is a text file, you can see there filename, SHA256 hash, file size, modify time for each file,
 and which hash (full/partial) is stored.
Preserve it, so DDFF will not need to reread a file again.
However, if you reorganize your file structure significantly, you can kill it.

The interface of the script is somewhat user-unfriendly.
I did the script just for myself.
If someone wants to do more, like GUI, win32 version, etc, take it and modify it freely.
Or write new, using these algorithms.
# How to be sure two folders have exactly the same contents?

Get SHA1 sum of every file in the current tree:

	find . -type f -print0  | xargs -0 sha1sum


	a58c2b2f2b1dad1ca0d8902fe43a9b2c97769d62  ./src.bak/patterns/11_arith_optimizations/mult_shifts_Keil_ARM_O3.s
	0a91f5aa4f8ee143dba531b21404c0968e96f7de  ./src.bak/patterns/11_arith_optimizations/exercises.tex
	2c41af9373f05675351f0b8e550bdba16cbb7d20  ./src.bak/patterns/11_arith_optimizations/div_EN.tex
	da39a3ee5e6b4b0d3255bfef95601890afd80709  ./src.bak/patterns/02_stack/TODO_rework_listings
	50eb41bfd9f5f2e33c9b30ea7d0c30022d10beff  ./src.bak/patterns/02_stack/main_RU.tex
	d12c4a8ebf5602a5fcae304955ff5b0f1968217e  ./src.bak/patterns/02_stack/main_PTBR.tex
	08c3b5df08f266c2972a631a6a3667055d34daa1  ./src.bak/patterns/02_stack/global_args.asm
	da39a3ee5e6b4b0d3255bfef95601890afd80709  ./src.bak/patterns/02_stack/04_alloca/TODO_rework_listings

Sort all the hashes:

	find . -type f -print0  | xargs -0 sha1sum | cut -f 1 -d ' ' | sort


Get SHA1 sum of the list of sorted hashes:

	find . -type f -print0  | xargs -0 sha1sum | cut -f 1 -d ' ' | sort | sha1sum

This is how we getting hash of the whole directory, which can be compared with other.
Sort must be performed, because files of two equal directories can be listed in different order.

This is very close to Merkle tree, it's how hash of a torrent is calculated.

I'm using this when transferring my old archives to a remote server and then to get sure the whole file tree has been copied correctly via network.

List of my other blog posts.

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.