Articles - jdupes

June 16, 2024June 16, 2024Uncategorized

Stop using –long-command-line-options

I’ve developed a pet peeve over time as I’ve seen others show their command line usage of jdupes in issue reports and examples. I’m sick of seeing long options.

The only way in which long options are good is that they are self-documenting. If you see --recurse then you have a pretty good idea what the option does while a simple -r doesn’t tell you anything meaningful. This is helpful to people who have no clue what’s going on or what they’re doing because they don’t need to look up anything. That’s where the value ends.

The case against long options

Rather than subject you to a wordy rant, here’s a concise list of why long options shouldn’t be used for almost everyone that’s not a total newbie:

Too much unnecessary typing. jdupes -rLHZQO1y . takes a lot less time to type than jdupes --recurse --link-hard --hard-links --soft-abort --quick --param-order --one-file-system --hash-db=.
They’re not actually self-documenting. What’s the difference between --link-hard and --hard-links in the previous command? The only way to know is to read the help text, but that’s the same thing you’d do to learn what the single-character options do.
The apparent self-documenting nature is deceptive. It can lead people to not read the help text and understand what things actually do. -Q/--quick is a speed tradeoff has a small but serious risk of data loss. One check of the help text to find out what -Q means will reveal this, but --quick implies faster, so why check?
Higher verbosity is more difficult to work within. Imagine working with the short and long examples above. If you know what a character option does, you can mentally parse the intent of -rLHZQ1Oy . very quickly, but the long option counterpart takes more time to scan and absorb. Editing a command line full of long options is tedious, especially if you don’t know that CTRL+left/right skips entire words at a time. It is also mentally very easy to swap “hard links” and “link hard” since they are nearly identical and cognitive swap errors are not uncommon.
Having both short and long options can double cognitive load. I need to deal with scripts or issue reports and they could use long options instead of short options. Now I have two separate things to memorize instead of one. I actually had to look up several long options for the above examples because I never use them.

I don’t think that long options should completely go away. They have their place, especially if a complex program starts to exhaust the limited set of single-character options available; in fact, this is an issue which jdupes is already brushing up against, with several options getting random characters because there are only two ways to write the letter R or L or S. I think long options are a niche tool that shouldn’t be used for most purposes. hdparm notoriously uses --please-destroy-my-drive as gatekeeping for options with very dangerous consequences and I think that’s fine, but when someone’s using my duplicate scanner tool to find duplicates including hard links and delete them, I’m always going to get a little angry inside when I see

--recurse --hard-links --delete --no-prompt

instead of

-rHdN

February 11, 2024June 16, 2024Uncategorized

Bring back struct dirent->d_namlen

Most UNIX-like operating systems built around the C programming language adhere to two standards: POSIX.1 and the Single UNIX Specification. The specific part we’re interested in for this article is the structure of a directory entry (“struct dirent”) as specified in dirent.h. The only field of a dirent that’s guaranteed to exist is the actual file name of the entry (d_name), but many other optional dirent members exist out there. The inode or serial number of the entry (d_ino, or d_fileno on BSD) is technically an X/Open System Interface (XSI) extension but almost everything in use today provides it. d_off, d_reclen, and d_type are all available on most UNIX-like systems. The variances in what dirent members are supported by each compilation environment are a notable source of compilation failures, portability issues, and general confusion.

What we’re interested in today is a lesser-known dirent member that is available on BSD systems, QNX (under a slightly different name), and not much else: d_namlen, the length of d_name. SUS doesn’t specify it, and neither does POSIX.1 or Android’s Bionic C library. glibc and some other C libraries provide a macro _DIRENT_HAVE_D_NAMLEN to discover that it’s not supported on most systems. Linus Torvalds himself has said that” d_namlen should go away” while “d_reclen actually makes sense.” It’s this claim combined with my recent experience with libjodycode that has motivated this article.

Linux, POSIX, and SUS have all shunned d_namlen. The logic often seen for its rejection is “just do strlen(dirent->d_name) instead because that’s all it is anyway” and the most rationale I’ve seen has come from Linus Torvalds in 1995:

I personally would like to totally forget about “d_namlen”, for a couple of reasons:
– it’s not POSIX. Thus program which uses it is broken.
– SVR4 doesn’t seem to have it, so programs that use it are doubly broken.
– it’s useless. Any broken program which uses it can trivially be altered to use “strlen(dirent->d_name)” instead.
…Compared to d_namlen, d_reclen actually makes sense.

This isn’t good enough for me. Standards like POSIX define what shall be present but generally don’t prohibit providing more. SUSv2 doesn’t have d_off, d_reclen, d_namlen, or d_type. The same is true for POSIX.1-2017. That leaves the third point: “it’s useless.” This is objectively incorrect. I’m here to argue that we should bring back d_namlen and enjoy the improvement in software performance that it can bring to the table.

Rationale for resurrecting d_namlen

The name length is already calculated and known to all OS kernels. It should be obvious that BSD systems already have d_namlen built in at every level since it’s available at user level already; this includes all versions of macOS since 2001. The Linux kernel has the name length flying around under the hood but the getdents() system call API doesn’t include it. Just like Linux, the Windows FindFirstFile() call doesn’t make it available but every Windows Driver Model (WDM) FILE_OBJECT has a FileName member that includes the length in bytes.
Many C programs need the file name length and use strlen() or similar behavior to get it. Examples include jdupes (loaddir.c), rsync (flist.c), BusyBox (libbb), GNU Coreutils (ls.c).
Knowing length ahead of time opens up several optimizations. A great example is using memcpy() instead of strcpy() which can take advantage of SSE and AVX to move the data more quickly. An even better example is simply skipping string length calculations altogether as seen in the current jdupes development branch.

How libjodycode is bringing it back

The original idea to pass file name length from readdir() calls back into applications came to me while working on the Windows side of libjodycode. The Windows Unicode support requires that file names read with e.g. FindFirstFileW() have their lengths counted for allocation operations. Why not pass that completed work into jdupes? Why should jdupes always have to use strlen(dirent->d_name), duplicating the work we’ve already done in libjodycode? Exploring this idea is how I discovered d_namlen and decided to include it in the Windows definition of a libjodycode dirent structure so it could be passed along. Most of the Linux/BSD/macOS side of libjodycode functions as a pass-through; that is, jc_readdir() just calls readdir() and uses the existing struct dirent definitions for whatever system it’s built on. Adding d_namlen would require tons of extra data copying that would hurt far more than having d_namlen would help. Even worse: Linux provides no equivalent, so d_namlen would be calculated with strlen() even if not used later.

Enter jc_get_d_namlen()! This new function included beside jc_readdir() allows a libjodycode program to extract the length of d_name in the most efficient way possible on the platform. On Windows it takes advantage of the d_namlen as provided by JC_DIRENT. On BSD and macOS it uses d_namlen already provided by struct dirent.

Remember that 1995 opinion by Linus about d_reclen making sense and how I said he was wrong? On Linux, structs are padded to 4- or 8-byte boundaries for efficiency, so doing some math against d_reclen only gives you the allocated size of the name, not the actual name length. Fortunately, this still makes it possible to skip over part of the name without checking it. In the absence of d_namlen provided by either the OS or a JC_DIRENT, the d_reclen size is used to calculate a skip count, then perform strlen() only on the last few bytes.

Of course, if d_reclen and d_namlen are both unavailable, jc_get_d_namlen() simply calls strlen() without any other work.

Synthetic benchmarks on Linux non-recursively running jc_readdir() 20,000 times against /usr/include and using write() to print the contents show that the jc_get_d_namlen() code is up to 13% faster than using strlen() directly, with Valgrind showing a similar drop in total CPU instructions executed. The worst performance boost I managed to achieve in all of my benchmarking was 0.7%. BSD/macOS and Windows should see even larger performance improvements since d_namlen is directly available and requires none of the work behind the d_reclen skip. I encourage anyone reading this and writing Linux C programs to steal my d_reclen skip code and see how much of a difference it makes.

The moral of the story is that only a fool duplicates their effort just to end up in the exact same place as the first time around.

December 7, 2023December 7, 2023Uncategorized

News: working toward jdupes 2.0

I have wanted to rewrite jdupes for a very long time. Some of the decisions inherited from fdupes have been a real pain when trying to add features. Unfortunately, every attempt at a comtoreplete rewrite was too daunting of a task, and several major overhauls have been started and subsequently abandoned. One of them has finally stuck, though, and it’s well on the way to becoming worthy of the version 2.0 milestone.

The biggest goal I’ve had is to replace the “file tree” data structure with something more flexible; thus, the size tree was born. There is no way for two files to be identical if their sizes are different. Grouping files into sizes before further work makes a lot of sense! I’ve also been replacing code containing “knowledge” of the actual structures behind file information storage with more generic “get next thing” style interfaces that are more flexible in nature.

One of the long-term goals is to parallelize jdupes. The widespread availability of multi-core processors, solid-state storage, and RAID storage means that a lot of opportunities to speed things up are lost in the current simple single-threaded operation paradigm. Abstracting out things in the right way will make it far easier to split work into multiple threads where it makes sense to do so. A specific case where a ton of performance is being left on the table is when files are being compared that are on different disks; a close second is exploiting the internally parallel nature of modern SSDs to perform parallel I/O on the same SSD.

As of this news post, I’m in the process of writing a “query” interface that will give other code modules a standard way to acquire information about discovered duplicate sets and make better decisions. The file tree information model made it very difficult to add features such as hard linking to the files with the highest hard link count first or intelligently establishing hard links when duplicates exist on different volumes. This rigidity led to a long-standing problem with reproducible builds in Debian that I couldn’t solve very easily. The only real solution to the problem of reproducible results was to nuke the file tree paradigm and the tree-specific code in all of the action modules in favor of a new system that allows for querying and sorting the data properly after duplicate scanning is done.

November 10, 2023November 10, 2023Uncategorized

News: fix for v1.27.3, libjodycode 4 work

I’ve backported a fix in the current development version of jdupes to v1.27.3 which avoids losing the existing hash database on a write failure:

0001-hashdb-backport-safety-fixes-from-master.patch Download

This fix writes out the in-memory hash database to a new temporary file, then removes the existing database and renames the temporary file to the original file name. If anything goes wrong during creation of the new database then the original database is not overwritten or deleted. I ran into this issue because the disk I was deduplicating ran out of disk space and the hash database file was destroyed as a result. I’ve emailed the patch to several jdupes distribution package maintainers but I can’t possibly find everyone. This patch should be applied for anyone who is distribution v1.27.3 anywhere.

I’ve been working on a major update to libjodycode and I’ve converted most of the file management functions called by jdupes that need Windows-specific support into libjodycode wrapper functions. Most of the functions simply pass through the call on non-Windows platforms. I’m hoping to have a few more basic functions ready to go before libjodycode 4 is officially released.

October 5, 2023October 5, 2023News

News: GitHub is OUT, Codeberg is IN

No more forced 2FA

jdupes, libjodycode, and all of my other software projects are migrating to Codeberg immediately.

Codeberg is not ideal due to the presence of rules lawyering-prone language in the “Allowed Content & Usage” section of the Terms of Service, but they are a major improvement over GitHub because they don’t force 2FA and they aren’t run by Microsoft. The GitHub AUP has abusive language similar to the Codeberg TOS so the ideological toxicity of this new overlord is no different than that of the old one. If any issues ever arise I’ll just fire up my own Git server and give all the overlords the finger.

Killing GitHub through attrition

Once migration is complete, I’ll be locking the GitHub repos as read-only forever, though I am inclined to delete them outright since the only power people have over giant corporations is no longer feeding them your time, money, content, and other resources. That’s why I destroyed my several years of contributions to Reddit. It’s more convenient to use the offerings of these companies, but using them results in horrible abuses of the users and stifling of freedoms in the long run.

October 4, 2023November 10, 2023News

News: major libjodycode rework, GitHub exit on hold, hashdb is great

libjodycode work in progress

I am in the middle of a major upgrade to libjodycode that will convert the library from a random collection of code I use across several of my projects into a Windows C Unicode portability library. One of the major disadvantages of compiling against the MinGW runtime is the lack of native Windows Unicode support. The largest part of libjodycode has always been the Windows Unicode support code that transparently converts to and from UTF-16 and handles quirky differences between Windows and UNIX-like platforms. What I’m working on is a conversion from a cobbled-together mess of code I copy-pasted from jdupes into a library that effectively replaces several basic C library calls such as access() and fopen() with versions that work with Windows Unicode without modification.

Unfortunately, while I’m working on libjodycode and all of my programs that interface with it to generalize everything out, you’re guaranteed to encounter problems if you try to use the latest master branches for libjodycode and jdupes. These will continue to be broken without notice until I’m done overhauling libjodycode.

GitHub exit on hold

In the August 28, 2023 news post, I said that I’d be leaving GitHub, meaning jdupes would also be leaving GitHub and moving somewhere else. It’s October 4 and that hasn’t happened yet. The (brief) reasons for this are:

GitHub replaced the insane CAPTCHA for setting up 2FA with something reasonable that actually worked when I tried it, and
I moved my computer consultancy and repair business out of my home and into a real office again–a process that put me behind on work by several weeks.

jdupes, libjodycode, and all my other software projects will still be removed from GitHub in the future, but the timeline is not certain right now. I’ll post another news item when I have a concrete transition date.

Hash database feature is awesome!

Remember that hash database feature I added to jdupes in August 2023? I’ve been using it on lots of different data sets since then and I’m amazed at how much time it saves! One of my residential PC repair customers doesn’t delete the pictures from the phone after phone picture import sessions; I was able to give them a one-line batch file that uses jdupes with the hashdb feature to remove all duplicate photo files in just a few seconds. They can run it immediately after importing photos to clean up after the import rather than changing their existing habits. Despite hashdb missing several basic features, it’s still insanely useful when used carefully.

August 31, 2023March 13, 2024News

News: hard link hash cross-copying

It’s faster to not do the same work twice

The next release of jdupes will have a feature that users won’t see but that will make a big difference for people who scan lots of hard-linked files: cross-copied hashes. When jdupes detects that two files being considered for comparison are hard linked and one of those files has already had some hashing work performed, the file with more work done will have its hashes copied to the file with less work done. This can make a big difference in performance since it eliminates extra work.

In jdupes, files are checked in pairs. If four duplicate files (A B C D) are to be checked then six comparisons need to be done (AB, AC, AD, BC, BD, CD). The way that jdupes worked in ancient times would require all four files to be read in their entirety three times each in order to reliably state that all of them are duplicates of one another. One day, the hard link matching optimization was added, making any hard linked files match with no reads at all. If two files (let’s say A and B) were hard linked, they’re literally the same piece of data on disk, thus they’re obvious duplicates without checking any data. However, while this eliminated the two full file reads that would be triggered by the AB candidate pair, this didn’t help at all with the other five pairs. The hard link relationship doesn’t apply to them, so it makes no difference in those matches.

However, there is an incredibly easy optimization that can be applied here: if the AB comparison happens after any other comparison (AC, BD, whatever) that required hashing A or B as part of the process then the hashes from one applies to the other. This is where cross-copied hashes come into play. Let’s say that the AC pair is scanned and found to be duplicate, then the AB pair is scanned. A and C have been both partially and fully hashed but B hasn’t yet. The hard link match would normally mark both as matched and move on, but applying cross-copied hashes before moving on results in A’s hashes being copied to B, so now B has been fully hashed without reading B (which is the same data as A, after all!) That means one less full file read when the BC or BD pairs are checked. This seems small, but in a large data set with lots of hard links it can represent thousands of files that aren’t unnecessarily read twice. This can make the difference between data being blown out of the OS disk caches or not. It’s small but it can really add up.

August 28, 2023November 10, 2023News

News: hash database feature, cross-copying hard link hashes, jdupes leaving GitHub

Hash database: millions of files scanned in seconds

jdupes version 1.27.3 has been released with a long-awaited and heavily requested feature: a file hash database. While the feature is not as “smart” I’d like it to be, I’ve decided to release it as soon as possible because it really is that big of a deal. It’s probably not going to help you if you’re scanning a folder tree with a hundred files, but the hash database feature makes a massive difference for repeated jdupes scans of large data sets that don’t change much. I’ve personally run tests on a colossal pile of already de-duplicated random media content (mostly web spider-gathered images, videos, and text files–nearly 1.9 million files in total!) and on a Ryzen 5 5600G machine with the -y . option (use a hash database; the period aliases to jdupes _hashdb.txt) the comparisons after the first run finishes in only a few seconds; I tested the same run without the hash database feature and when it was 10% finished and I was tired of waiting, it had spent 71 seconds running unnecessary comparisons. When new files are added it only takes a few seconds to find and delete them.

The major flaw in the hash database feature is in how it handles file and folder paths. If you run jdupes -y . testdir and jdupes -y . ./testdir and jdupes -y . ../currentdir/testdir the database component sees those as completely different paths. This is an obvious flaw, but the workaround is to run jdupes from the same working directory each time and to specify the target path(s) the same way each time. This issue will be fixed in a future release.

Cross-copied hashes: faster scans by not repeating work

Every time jdupes examines a pair of files to see if they could be duplicates, the contents of both files are read and hashes (numbers based on a file’s contents that are used as a “shortcut” to quickly compare that file against other files) of the contents are generated, then they are compared to see if files should be examined even further. An upcoming performance enhancement to jdupes is hash cross-copying. This is where two files being compared during a scan are found to be hard-linked (they look like two separate files to the user but point to the same data and metadata on-disk) and the hashes are copied between the in-memory information about each file. Before this enhancement, two files that are literally the same file would be hashed and compared separately, potentially wasting a lot of time re-reading the same data to calculate the same hashes. While this enhancement doesn’t completely avoid this unnecessary work, it does avoid it after a hard-linked pair of files is detected. This can result in a significant performance boost. It will be included in the next release of jdupes.

Leaving GitHub: forced two-factor authentication is evil

The last news is also the worst. I (Jody Bruchon) intend to leave GitHub entirely by the end of September 2023. The main driver of this is GitHub implementing mandatory two-factor authentication (abbreviated 2FA, typically thought of by most people as “requiring a cell phone to log in”), though there are several other reasons for leaving that have built me up to this point. 2FA is a massive double-edged sword. I have run a PC repair shop for a very long time and I have seen countless customers lose their email and social media accounts forever due to a combination of being railed into 2FA they didn’t want plus losing access to the phone number or the actual phone itself that 2FA was set up to work with.

GitHub two-factor authentication blog post graphic — GitHub’s picture of a phone locking you out of your GitHub account. How appropriate.

The increased security offered by 2FA comes from requiring verification outside of the traditional “give me your password” system of logging in; someone who gets their hands on your password probably won’t also have your unlocked cell phone in their hand when they try to log into your account using that stolen password, and that’s how and why 2FA works in a nutshell. Unfortunately, this also means that your password isn’t enough for YOU to log in, either; if you lose access to your second factor, you lose your account even if you know your password, and that’s how and why 2FA is a bad thing in a nutshell. There are other aspects such as the use of “authenticator apps” on smartphones that pose additional tracking and privacy risks that SMS-code-based 2FA doesn’t, but this isn’t the appropriate place to discuss this topic in such depth.

The general plan is to make this website, jdupes.com, the official website for the jdupes and libjodycode projects, while all my other software projects will be re-homed to the Software page on jodybruchon.com. I haven’t figured out what I’ll do about hosting my Git repositories and an issue tracker yet. Keep checking the jdupes.com front page for updates to see where things land.

August 23, 2023November 10, 2023History of jdupes

The history of jdupes: why it’s called that and where it came from

The short version

The “J” stands for “Jody” and the name “jdupes” means “Jody’s fdupes.“ It was originally a set of patches by Jody Bruchon for fdupes that the fdupes guy rejected. Development continued, the project was renamed, and jdupes has long surpassed fdupes in both features and popularity.

Prehistoric times: fdupes is king

jdupes began as a simple modification to a duplicate file finding utility called fdupes (which literally stands for “find duplicates”) by Adrian Lopez. fdupes has been around since at least 1999 and has been the de facto duplicate finder tool for Linux and BSD systems ever since I can remember. I used fdupes for a long time and was relatively happy with it. If you weren’t using Windows or Mac OS in the 2000s and you needed to find duplicate files, you were using fdupes.

Cracks appear: fdupes is slower than it should be

I routinely work with Linux machines over secure shell (SSH) sessions. If you’re not familiar with SSH, it’s essentially a command prompt over a network connection. One of the disadvantages with SSH connections is that sending text over a network is much slower than sending it to the monitor attached directly to the computer. This isn’t usually a big deal because most tools aren’t spamming you with output as fast as the screen (or SSH connection) can show it. Unfortunately, I noticed one day that fdupes was much faster when I asked for “quiet mode” than when the progress indicators were shown. The files weren’t different between runs and most of the heavy lifting in fdupes should have been in looking at the files. Why would progress indication cause such a big drop in performance? It was clear that something weird was going on, and I was going to get to the bottom of it. I downloaded the fdupes source code, solved the problem, and the baby that would grow into jdupes was born.

The first change I made to fdupes, which was also the birth of jdupes

I discovered the reason that fdupes was so slow over SSH: way too many progress indicator updates. A new progress line was being printed for every single file in every single directory. The function used to do this is fprintf() which performs blocking I/O, meaning it freezes the entire program until it finishes printing everything out. This is extremely fast if you’re on a local console but not nearly as fast over SSH because fprintf() freezes the program until the progress message is transmitted to the SSH client. The simplest way to fix this was to add a counter that only prints a progress update after hundreds of files rather than after every single file. That’s exactly what I did during the last week of December 2014, along with a few other obvious optimizations I had performed along the way.

The fdupes guy rejects my work

Adrian Lopez-Roche is the maintainer of the original fdupes program. He was also the obvious first person to contact to get my work included in fdupes so that other people could benefit from what I’d found and fixed. The patch set I offered up included a switch from MD5 hashes to my “jodyhash” code which brought a further 17% speed increase and reduced the number of total CPU instructions needed for a large test data set by 73%, meaning if the same data took 100 seconds to process before, my patches dropped that to 42 seconds. I threw up a pull request on GitHub and sent him an email.

At first, he seemed receptive to the changes and just wanted them sent as separate patches with some modifications. Despite my outreach and my efforts to comply with his requests, nothing was ever changed. There is not one single fdupes improvement credited to me. Being ignored was infuriating, especially when I’d offered up some very valuable changes. After being ignored for a while I renamed my fork to fdupes-jody and paid little attention to fdupes after that. The only other interaction I had with Adrian was when he emailed me and told me I had to rename my project to not have “fdupes” in the name; my response was to name it after myself as I often do with my software projects: jdupes originally meant “Jody’s fdupes.”

2015: fdupes-jody grows into jdupes

After getting my hands dirty in the fdupes code, I quickly realized that the program was begging for a ton of optimization. I spent a lot of my spare time in 2015 improving fdupes-jody beyond the original patch set. Here are just a few of the changes:

Removing tons of unnecessary code such as one-line functions that look like this: int function (blah) { return other_function(blah) }
Port everything to Windows (MinGW) and Mac OS X
Several improvements to the progress indicator
Add support for hard linking duplicates, not just printing or deleting
Add the --xsize option to exclude files based on their sizes
Add numerically correct sorting; without this e.g. “10” sorts before “2”
Add BTRFS block-level dedupe support
More minor performance improvements than this list allows

By the end of 2015 the program could do a lot of things that fdupes couldn’t, and it did everything faster and better as well. On December 23, 2015, “fdupes-jody 2.2” was officially renamed to jdupes 1.0, leaving the last traces of the now-stagnant fdupes project behind for good.

Healthy competition: dupd and fclones

It’s always good to see how your work compares to others in your field. Looking at the work of others can help you to understand your own better and give you new ideas to work with. There are two competing programs that I’ve looked at: dupd and fclones.

The dupd story is still available on the Jody Bruchon blog and goes into a lot of detail, so I won’t copy it here. I had several productive conversations with the dupd maintainer, Jyri Virikki, and it was fun to measure up our tools and talk about how we might improve them.

The story is very different with fclones. It was written in Rust, heavily multi-threaded, and promised a ton of performance benefits, showing off benchmarks on a highly multi-threaded Xeon system with a big PCIe solid-state drive. I wondered how it would work with my data on my Linux RAID-5 storage server and I had to give it a shot. Maybe I’d be impressed and find some new ideas to adopt. There is no doubt that Piotr Kołaczkowski is both a brilliant programmer and has a lot more time to work on his code than I do, but my initial experience with fclones against a RAID array was beyond terrible. I got so frustrated that I made an unnecessarily angry video about it. Piotr eventually found the video and made changes to disable threaded reads on hard drives, so something good came out of my frustration. I’ve also continued to read his programming blog posts on a regular basis.

Modern jdupes: the new king?

Today you’ll find jdupes in almost every remotely popular Linux distribution: Debian, Ubuntu, Arch Linux, Void Linux, Alpine Linux, and the list goes on. The feature set is very rich and I am still working on adding more features and improving performance to this day. I think the results speak for themselves.

Bonus: jdupes births a library

I created a shared library called libjodycode because I was copy-pasting pieces of jdupes code into several other software projects and it was starting to get annoying maintaining all these slightly different pieces of the same code. If you’re interested, there’s a video I made talking all about how it came to be and how writing a shared library was a surprisingly difficult and frustrating challenge.

July 20, 2023July 20, 2023Explaining The Algorithm

Why doesn’t jdupes use bigger partial hashes?

Publishing open source software invites thousands of people to look over your work and find ways to make it better. Some of the suggestions for improving jdupes involve changing the way that files are examined to try to reject non-matches more quickly. One of those suggestions is increasing the size of the partial hash from 4 KiB to a larger power-of-two value. This suggestion is always rejected.

What does the “partial hash” do?

The purpose of a duplicate file finder is to find identical files and do something with them: print a list, delete them, hard link them, and so on. You could simply compare every file’s contents against every other file’s contents, but this becomes an impossibly slow task as the number of files gets larger. The real purpose of a duplicate file finder is to quickly find reasons that files can’t possibly be duplicates before comparing the files directly.

Part of this process involves reading a tiny piece of the beginning of every file and comparing those tiny pieces to see if they’re different. Instead of comparing the pieces directly, it’s faster to compute large numbers called hashes that act as a substitute for the pieces of data, then compare those hashes. The same data will always produce the same hash while different data will sometimes but not always produce a different hash. If the hashes are different then the data is different; therefore, the files are different and can’t be a match. The same hashing is done on entire files once these hashes of a small part of the file or “partial hashes” have passed this check.

Default partial hash size will not change

The rationale behind choosing a minimum for the partial hash size is surprisingly simple. Nearly all storage media created in the past decade uses a sector size of 4,096 bytes (4 KiB or binary kilobytes). If you only ask to read 1 byte from a file, the operating system still receives a 4 KiB piece of data to service this request. All modern filesystems also default to 4 KiB as the smallest storage unit; Windows’s FAT, exFAT, and NTFS filesystems call it “cluster size” while most others call it “block size.” The default partial hash size in jdupes is 4 KiB because that’s the smallest size that everything in modern computer storage works with. A smaller partial hash size than 4 KiB would increase the chance of false positives while providing zero performance benefit.

Now that the 4 KiB minimum makes sense, why not make it bigger? A bigger partial hash will reject more non-matching files and avoid a lot of unnecessary work…right? Unfortunately, real-world testing shows that this doesn’t work out as expected.

Partial hash: 8 KiB vs. 4 KiB vs. 16 KiB.

I ran tests on the same data set of already de-duplicated mixed media (images, videos, animated GIFs, HTML and CSS text files). Out of 24K non-matching files, doubling the partial hash size from 4 KiB to 8 KiB only avoided 5 full file reads out of 932, a 0.5% improvement, but reading an extra 3.7 MiB of data to do so. It’s possible that the 5 skipped files were greater than 3.7 MiB in size, thus resulting in a performance increase with the larger partial hash size, but it’s not likely given the nature of the data. I’ve run several such tests on several different data sets and found that most data follows this same pattern: if it’s not different in the first 4 KiB block, it’s extremely unlikely to be different in the first 8 KiB, or 16 KiB, or 32 KiB as well.

Every doubling of the partial hash size also doubles the amount of data read to generate those hashes. 16 KiB partial hashes require 7.4 MiB of extra data read from disk, 32 KiB requires 14.8 MiB extra, and so on. The only scenario where larger partial hashes provide a benefit is when working with large files that have identical 4 KiB blocks of data at the start. It depends entirely on the data you’re working with. For general usage, 4 KiB partial hashing provides the best balance between unnecessary partial-file reads and unnecessary full-file reads.

You will be able to change partial hash size

As stated earlier, data sets do exist where a larger partial hash size can drastically speed things up. Because of this, a planned feature exists for jdupes which will add the ability to change partial hash size on the command line, similar to how the I/O chunk size can be increased to improve performance on traditional “spinning rust” hard disk drives. I’ll update this page when that feature goes live.