jdupes: Jody Bruchon’s powerful duplicate file finder

NOTE: I am aware of the HTTPS cert issue. My host doesn’t support Let’s Encrypt and I’m not paying for a new SSL cert. Use HTTP. No files are hosted here, so it’s perfectly safe.

If you’d like to financially support jdupes development, use the links below!


Latest News

jdupes 1.28.0 released

I’ve been working on a major rewrite of jdupes which will eventually become version 2, but it’s been almost a year since a new jdupes version was released and several minor issues have come up since 1.27.3 was released. I’ve created a new v1.x branch and I’ll continue to release backported updates and fixes when possible.

Get jdupes 1.28.0 in the Codebeg repository release area. This release fixes some issues affecting the in-development hash database feature and makes interactive deletion much safer.

I’ve developed a pet peeve over time as I’ve seen others show their command line usage of jdupes in issue reports and examples. I’m sick of seeing long options.

The only way in which long options are good is that they are self-documenting. If you see --recurse then you have a pretty good idea what the option does while a simple -r doesn’t tell you anything meaningful. This is helpful to people who have no clue what’s going on or what they’re doing because they don’t need to look up anything. That’s where the value ends.

The case against long options

Rather than subject you to a wordy rant, here’s a concise list of why long options shouldn’t be used for almost everyone that’s not a total newbie:

  • Too much unnecessary typing. jdupes -rLHZQO1y . takes a lot less time to type than jdupes --recurse --link-hard --hard-links --soft-abort --quick --param-order --one-file-system --hash-db=.
  • They’re not actually self-documenting. What’s the difference between --link-hard and --hard-links in the previous command? The only way to know is to read the help text, but that’s the same thing you’d do to learn what the single-character options do.
  • The apparent self-documenting nature is deceptive. It can lead people to not read the help text and understand what things actually do. -Q/--quick is a speed tradeoff has a small but serious risk of data loss. One check of the help text to find out what -Q means will reveal this, but --quick implies faster, so why check?
  • Higher verbosity is more difficult to work within. Imagine working with the short and long examples above. If you know what a character option does, you can mentally parse the intent of -rLHZQ1Oy . very quickly, but the long option counterpart takes more time to scan and absorb. Editing a command line full of long options is tedious, especially if you don’t know that CTRL+left/right skips entire words at a time. It is also mentally very easy to swap “hard links” and “link hard” since they are nearly identical and cognitive swap errors are not uncommon.
  • Having both short and long options can double cognitive load. I need to deal with scripts or issue reports and they could use long options instead of short options. Now I have two separate things to memorize instead of one. I actually had to look up several long options for the above examples because I never use them.

I don’t think that long options should completely go away. They have their place, especially if a complex program starts to exhaust the limited set of single-character options available; in fact, this is an issue which jdupes is already brushing up against, with several options getting random characters because there are only two ways to write the letter R or L or S. I think long options are a niche tool that shouldn’t be used for most purposes. hdparm notoriously uses --please-destroy-my-drive as gatekeeping for options with very dangerous consequences and I think that’s fine, but when someone’s using my duplicate scanner tool to find duplicates including hard links and delete them, I’m always going to get a little angry inside when I see

--recurse --hard-links --delete --no-prompt

instead of

-rHdN

The “string tax” technical debt

Most UNIX-like operating systems built around the C programming language adhere to two standards: POSIX.1 and the Single UNIX Specification. The specific part we’re interested in for this article is the structure of a directory entry (“struct dirent”) as specified in dirent.h. The only field of a dirent that’s guaranteed to exist is the actual file name of the entry (d_name), but many other optional dirent members exist out there. The inode or serial number of the entry (d_ino, or d_fileno on BSD) is technically an X/Open System Interface (XSI) extension but almost everything in use today provides it. d_off, d_reclen, and d_type are all available on most UNIX-like systems. The variances in what dirent members are supported by each compilation environment are a notable source of compilation failures, portability issues, and general confusion.

What we’re interested in today is a lesser-known dirent member that is available on BSD systems, QNX (under a slightly different name), and not much else: d_namlen, the length of d_name. SUS doesn’t specify it, and neither does POSIX.1 or Android’s Bionic C library. glibc and some other C libraries provide a macro _DIRENT_HAVE_D_NAMLEN to discover that it’s not supported on most systems. Linus Torvalds himself has said that” d_namlen should go away” while “d_reclen actually makes sense.” It’s this claim combined with my recent experience with libjodycode that has motivated this article.

Linux, POSIX, and SUS have all shunned d_namlen. The logic often seen for its rejection is “just do strlen(dirent->d_name) instead because that’s all it is anyway” and the most rationale I’ve seen has come from Linus Torvalds in 1995:

I personally would like to totally forget about “d_namlen”, for a couple of reasons:
– it’s not POSIX. Thus program which uses it is broken.
– SVR4 doesn’t seem to have it, so programs that use it are doubly broken.
– it’s useless. Any broken program which uses it can trivially be altered to use “strlen(dirent->d_name)” instead.
…Compared to d_namlen, d_reclen actually makes sense.

This isn’t good enough for me. Standards like POSIX define what shall be present but generally don’t prohibit providing more. SUSv2 doesn’t have d_off, d_reclen, d_namlen, or d_type. The same is true for POSIX.1-2017. That leaves the third point: “it’s useless.” This is objectively incorrect. I’m here to argue that we should bring back d_namlen and enjoy the improvement in software performance that it can bring to the table.

Rationale for resurrecting d_namlen

How libjodycode is bringing it back

The original idea to pass file name length from readdir() calls back into applications came to me while working on the Windows side of libjodycode. The Windows Unicode support requires that file names read with e.g. FindFirstFileW() have their lengths counted for allocation operations. Why not pass that completed work into jdupes? Why should jdupes always have to use strlen(dirent->d_name), duplicating the work we’ve already done in libjodycode? Exploring this idea is how I discovered d_namlen and decided to include it in the Windows definition of a libjodycode dirent structure so it could be passed along. Most of the Linux/BSD/macOS side of libjodycode functions as a pass-through; that is, jc_readdir() just calls readdir() and uses the existing struct dirent definitions for whatever system it’s built on. Adding d_namlen would require tons of extra data copying that would hurt far more than having d_namlen would help. Even worse: Linux provides no equivalent, so d_namlen would be calculated with strlen() even if not used later.

Enter jc_get_d_namlen()! This new function included beside jc_readdir() allows a libjodycode program to extract the length of d_name in the most efficient way possible on the platform. On Windows it takes advantage of the d_namlen as provided by JC_DIRENT. On BSD and macOS it uses d_namlen already provided by struct dirent.

Remember that 1995 opinion by Linus about d_reclen making sense and how I said he was wrong? On Linux, structs are padded to 4- or 8-byte boundaries for efficiency, so doing some math against d_reclen only gives you the allocated size of the name, not the actual name length. Fortunately, this still makes it possible to skip over part of the name without checking it. In the absence of d_namlen provided by either the OS or a JC_DIRENT, the d_reclen size is used to calculate a skip count, then perform strlen() only on the last few bytes.

Of course, if d_reclen and d_namlen are both unavailable, jc_get_d_namlen() simply calls strlen() without any other work.

Synthetic benchmarks on Linux non-recursively running jc_readdir() 20,000 times against /usr/include and using write() to print the contents show that the jc_get_d_namlen() code is up to 13% faster than using strlen() directly, with Valgrind showing a similar drop in total CPU instructions executed. The worst performance boost I managed to achieve in all of my benchmarking was 0.7%. BSD/macOS and Windows should see even larger performance improvements since d_namlen is directly available and requires none of the work behind the d_reclen skip. I encourage anyone reading this and writing Linux C programs to steal my d_reclen skip code and see how much of a difference it makes.

The moral of the story is that only a fool duplicates their effort just to end up in the exact same place as the first time around.