I have wanted to rewrite jdupes for a very long time. Some of the decisions inherited from fdupes have been a real pain when trying to add features. Unfortunately, every attempt at a comtoreplete rewrite was too daunting of a task, and several major overhauls have been started and subsequently abandoned. One of them has finally stuck, though, and it’s well on the way to becoming worthy of the version 2.0 milestone.
The biggest goal I’ve had is to replace the “file tree” data structure with something more flexible; thus, the size tree was born. There is no way for two files to be identical if their sizes are different. Grouping files into sizes before further work makes a lot of sense! I’ve also been replacing code containing “knowledge” of the actual structures behind file information storage with more generic “get next thing” style interfaces that are more flexible in nature.
One of the long-term goals is to parallelize jdupes. The widespread availability of multi-core processors, solid-state storage, and RAID storage means that a lot of opportunities to speed things up are lost in the current simple single-threaded operation paradigm. Abstracting out things in the right way will make it far easier to split work into multiple threads where it makes sense to do so. A specific case where a ton of performance is being left on the table is when files are being compared that are on different disks; a close second is exploiting the internally parallel nature of modern SSDs to perform parallel I/O on the same SSD.
As of this news post, I’m in the process of writing a “query” interface that will give other code modules a standard way to acquire information about discovered duplicate sets and make better decisions. The file tree information model made it very difficult to add features such as hard linking to the files with the highest hard link count first or intelligently establishing hard links when duplicates exist on different volumes. This rigidity led to a long-standing problem with reproducible builds in Debian that I couldn’t solve very easily. The only real solution to the problem of reproducible results was to nuke the file tree paradigm and the tree-specific code in all of the action modules in favor of a new system that allows for querying and sorting the data properly after duplicate scanning is done.