News: hard link hash cross-copying

It’s faster to not do the same work twice

The next release of jdupes will have a feature that users won’t see but that will make a big difference for people who scan lots of hard-linked files: cross-copied hashes. When jdupes detects that two files being considered for comparison are hard linked and one of those files has already had some hashing work performed, the file with more work done will have its hashes copied to the file with less work done. This can make a big difference in performance since it eliminates extra work.

In jdupes, files are checked in pairs. If four duplicate files (A B C D) are to be checked then six comparisons need to be done (AB, AC, AD, BC, BD, CD). The way that jdupes worked in ancient times would require all four files to be read in their entirety three times each in order to reliably state that all of them are duplicates of one another. One day, the hard link matching optimization was added, making any hard linked files match with no reads at all. If two files (let’s say A and B) were hard linked, they’re literally the same piece of data on disk, thus they’re obvious duplicates without checking any data. However, while this eliminated the two full file reads that would be triggered by the AB candidate pair, this didn’t help at all with the other five pairs. The hard link relationship doesn’t apply to them, so it makes no difference in those matches.

However, there is an incredibly easy optimization that can be applied here: if the AB comparison happens after any other comparison (AC, BD, whatever) that required hashing A or B as part of the process then the hashes from one applies to the other. This is where cross-copied hashes come into play. Let’s say that the AC pair is scanned and found to be duplicate, then the AB pair is scanned. A and C have been both partially and fully hashed but B hasn’t yet. The hard link match would normally mark both as matched and move on, but applying cross-copied hashes before moving on results in A’s hashes being copied to B, so now B has been fully hashed without reading B (which is the same data as A, after all!) That means one less full file read when the BC or BD pairs are checked. This seems small, but in a large data set with lots of hard links it can represent thousands of files that aren’t unnecessarily read twice. This can make the difference between data being blown out of the OS disk caches or not. It’s small but it can really add up.