jdupes: Jody Bruchon’s powerful duplicate file finder

If you’d like to financially support jdupes development, use the links below!


Latest News

I have wanted to rewrite jdupes for a very long time. Some of the decisions inherited from fdupes have been a real pain when trying to add features. Unfortunately, every attempt at a comtoreplete rewrite was too daunting of a task, and several major overhauls have been started and subsequently abandoned. One of them has finally stuck, though, and it’s well on the way to becoming worthy of the version 2.0 milestone.

The biggest goal I’ve had is to replace the “file tree” data structure with something more flexible; thus, the size tree was born. There is no way for two files to be identical if their sizes are different. Grouping files into sizes before further work makes a lot of sense! I’ve also been replacing code containing “knowledge” of the actual structures behind file information storage with more generic “get next thing” style interfaces that are more flexible in nature.

One of the long-term goals is to parallelize jdupes. The widespread availability of multi-core processors, solid-state storage, and RAID storage means that a lot of opportunities to speed things up are lost in the current simple single-threaded operation paradigm. Abstracting out things in the right way will make it far easier to split work into multiple threads where it makes sense to do so. A specific case where a ton of performance is being left on the table is when files are being compared that are on different disks; a close second is exploiting the internally parallel nature of modern SSDs to perform parallel I/O on the same SSD.

As of this news post, I’m in the process of writing a “query” interface that will give other code modules a standard way to acquire information about discovered duplicate sets and make better decisions. The file tree information model made it very difficult to add features such as hard linking to the files with the highest hard link count first or intelligently establishing hard links when duplicates exist on different volumes. This rigidity led to a long-standing problem with reproducible builds in Debian that I couldn’t solve very easily. The only real solution to the problem of reproducible results was to nuke the file tree paradigm and the tree-specific code in all of the action modules in favor of a new system that allows for querying and sorting the data properly after duplicate scanning is done.

No more forced 2FA

jdupes, libjodycode, and all of my other software projects are migrating to Codeberg immediately.

Codeberg is not ideal due to the presence of rules lawyering-prone language in the “Allowed Content & Usage” section of the Terms of Service, but they are a major improvement over GitHub because they don’t force 2FA and they aren’t run by Microsoft. The GitHub AUP has abusive language similar to the Codeberg TOS so the ideological toxicity of this new overlord is no different than that of the old one. If any issues ever arise I’ll just fire up my own Git server and give all the overlords the finger.

Killing GitHub through attrition

Once migration is complete, I’ll be locking the GitHub repos as read-only forever, though I am inclined to delete them outright since the only power people have over giant corporations is no longer feeding them your time, money, content, and other resources. That’s why I destroyed my several years of contributions to Reddit. It’s more convenient to use the offerings of these companies, but using them results in horrible abuses of the users and stifling of freedoms in the long run.