Why doesn’t jdupes use bigger partial hashes?
Publishing open source software invites thousands of people to look over your work and find ways to make it better. Some of the suggestions for improving jdupes involve changing the way that files are examined to try to reject non-matches more quickly. One of those suggestions is increasing the size of the partial hash from 4 KiB to a larger power-of-two value. This suggestion is always rejected.
What does the “partial hash” do?
The purpose of a duplicate file finder is to find identical files and do something with them: print a list, delete them, hard link them, and so on. You could simply compare every file’s contents against every other file’s contents, but this becomes an impossibly slow task as the number of files gets larger. The real purpose of a duplicate file finder is to quickly find reasons that files can’t possibly be duplicates before comparing the files directly.
Part of this process involves reading a tiny piece of the beginning of every file and comparing those tiny pieces to see if they’re different. Instead of comparing the pieces directly, it’s faster to compute large numbers called hashes that act as a substitute for the pieces of data, then compare those hashes. The same data will always produce the same hash while different data will sometimes but not always produce a different hash. If the hashes are different then the data is different; therefore, the files are different and can’t be a match. The same hashing is done on entire files once these hashes of a small part of the file or “partial hashes” have passed this check.
Default partial hash size will not change
The rationale behind choosing a minimum for the partial hash size is surprisingly simple. Nearly all storage media created in the past decade uses a sector size of 4,096 bytes (4 KiB or binary kilobytes). If you only ask to read 1 byte from a file, the operating system still receives a 4 KiB piece of data to service this request. All modern filesystems also default to 4 KiB as the smallest storage unit; Windows’s FAT, exFAT, and NTFS filesystems call it “cluster size” while most others call it “block size.” The default partial hash size in jdupes is 4 KiB because that’s the smallest size that everything in modern computer storage works with. A smaller partial hash size than 4 KiB would increase the chance of false positives while providing zero performance benefit.
Now that the 4 KiB minimum makes sense, why not make it bigger? A bigger partial hash will reject more non-matching files and avoid a lot of unnecessary work…right? Unfortunately, real-world testing shows that this doesn’t work out as expected.
I ran tests on the same data set of already de-duplicated mixed media (images, videos, animated GIFs, HTML and CSS text files). Out of 24K non-matching files, doubling the partial hash size from 4 KiB to 8 KiB only avoided 5 full file reads out of 932, a 0.5% improvement, but reading an extra 3.7 MiB of data to do so. It’s possible that the 5 skipped files were greater than 3.7 MiB in size, thus resulting in a performance increase with the larger partial hash size, but it’s not likely given the nature of the data. I’ve run several such tests on several different data sets and found that most data follows this same pattern: if it’s not different in the first 4 KiB block, it’s extremely unlikely to be different in the first 8 KiB, or 16 KiB, or 32 KiB as well.
Every doubling of the partial hash size also doubles the amount of data read to generate those hashes. 16 KiB partial hashes require 7.4 MiB of extra data read from disk, 32 KiB requires 14.8 MiB extra, and so on. The only scenario where larger partial hashes provide a benefit is when working with large files that have identical 4 KiB blocks of data at the start. It depends entirely on the data you’re working with. For general usage, 4 KiB partial hashing provides the best balance between unnecessary partial-file reads and unnecessary full-file reads.
You will be able to change partial hash size
As stated earlier, data sets do exist where a larger partial hash size can drastically speed things up. Because of this, a planned feature exists for jdupes which will add the ability to change partial hash size on the command line, similar to how the I/O chunk size can be increased to improve performance on traditional “spinning rust” hard disk drives. I’ll update this page when that feature goes live.