This is the home page for jdupes, a powerful duplicate file finder. Have a look at my jdupes articles, check out the jdupes GitHub repo, and download the latest release!
NOTE: I am aware of the HTTPS cert issue. My host doesn’t support Let’s Encrypt and I’m not paying for a new SSL cert. Use HTTP. No files are hosted here, so it’s perfectly safe.
If you’d like to financially support jdupes development, use the links below!
Latest News
jdupes 1.28.0 released
I’ve been working on a major rewrite of jdupes which will eventually become version 2, but it’s been almost a year since a new jdupes version was released and several minor issues have come up since 1.27.3 was released. I’ve created a new v1.x branch and I’ll continue to release backported updates and fixes when possible.
Get jdupes 1.28.0 in the Codebeg repository release area. This release fixes some issues affecting the in-development hash database feature and makes interactive deletion much safer.
I’ve developed a pet peeve over time as I’ve seen others show their command line usage of jdupes in issue reports and examples. I’m sick of seeing long options.
The only way in which long options are good is that they are self-documenting. If you see --recurse then you have a pretty good idea what the option does while a simple -r doesn’t tell you anything meaningful. This is helpful to people who have no clue what’s going on or what they’re doing because they don’t need to look up anything. That’s where the value ends.
The case against long options
Rather than subject you to a wordy rant, here’s a concise list of why long options shouldn’t be used for almost everyone that’s not a total newbie:
- Too much unnecessary typing.
jdupes -rLHZQO1y .takes a lot less time to type thanjdupes --recurse --link-hard --hard-links --soft-abort --quick --param-order --one-file-system --hash-db=. - They’re not actually self-documenting. What’s the difference between
--link-hardand--hard-linksin the previous command? The only way to know is to read the help text, but that’s the same thing you’d do to learn what the single-character options do. - The apparent self-documenting nature is deceptive. It can lead people to not read the help text and understand what things actually do.
-Q/--quickis a speed tradeoff has a small but serious risk of data loss. One check of the help text to find out what-Qmeans will reveal this, but--quickimplies faster, so why check? - Higher verbosity is more difficult to work within. Imagine working with the short and long examples above. If you know what a character option does, you can mentally parse the intent of
-rLHZQ1Oy .very quickly, but the long option counterpart takes more time to scan and absorb. Editing a command line full of long options is tedious, especially if you don’t know that CTRL+left/right skips entire words at a time. It is also mentally very easy to swap “hard links” and “link hard” since they are nearly identical and cognitive swap errors are not uncommon. - Having both short and long options can double cognitive load. I need to deal with scripts or issue reports and they could use long options instead of short options. Now I have two separate things to memorize instead of one. I actually had to look up several long options for the above examples because I never use them.
I don’t think that long options should completely go away. They have their place, especially if a complex program starts to exhaust the limited set of single-character options available; in fact, this is an issue which jdupes is already brushing up against, with several options getting random characters because there are only two ways to write the letter R or L or S. I think long options are a niche tool that shouldn’t be used for most purposes. hdparm notoriously uses --please-destroy-my-drive as gatekeeping for options with very dangerous consequences and I think that’s fine, but when someone’s using my duplicate scanner tool to find duplicates including hard links and delete them, I’m always going to get a little angry inside when I see
--recurse --hard-links --delete --no-prompt
instead of
-rHdN
The “string tax” technical debt
Most UNIX-like operating systems built around the C programming language adhere to two standards: POSIX.1 and the Single UNIX Specification. The specific part we’re interested in for this article is the structure of a directory entry (“struct dirent”) as specified in dirent.h. The only field of a dirent that’s guaranteed to exist is the actual file name of the entry (d_name), but many other optional dirent members exist out there. The inode or serial number of the entry (d_ino, or d_fileno on BSD) is technically an X/Open System Interface (XSI) extension but almost everything in use today provides it. d_off, d_reclen, and d_type are all available on most UNIX-like systems. The variances in what dirent members are supported by each compilation environment are a notable source of compilation failures, portability issues, and general confusion.
What we’re interested in today is a lesser-known dirent member that is available on BSD systems, QNX (under a slightly different name), and not much else: d_namlen, the length of d_name. SUS doesn’t specify it, and neither does POSIX.1 or Android’s Bionic C library. glibc and some other C libraries provide a macro _DIRENT_HAVE_D_NAMLEN to discover that it’s not supported on most systems. Linus Torvalds himself has said that” d_namlen should go away” while “d_reclen actually makes sense.” It’s this claim combined with my recent experience with libjodycode that has motivated this article.
Linux, POSIX, and SUS have all shunned d_namlen. The logic often seen for its rejection is “just do strlen(dirent->d_name) instead because that’s all it is anyway” and the most rationale I’ve seen has come from Linus Torvalds in 1995:
I personally would like to totally forget about “d_namlen”, for a couple of reasons:
– it’s not POSIX. Thus program which uses it is broken.
– SVR4 doesn’t seem to have it, so programs that use it are doubly broken.
– it’s useless. Any broken program which uses it can trivially be altered to use “strlen(dirent->d_name)” instead.
…Compared to d_namlen, d_reclen actually makes sense.
This isn’t good enough for me. Standards like POSIX define what shall be present but generally don’t prohibit providing more. SUSv2 doesn’t have d_off, d_reclen, d_namlen, or d_type. The same is true for POSIX.1-2017. That leaves the third point: “it’s useless.” This is objectively incorrect. I’m here to argue that we should bring back d_namlen and enjoy the improvement in software performance that it can bring to the table.
Rationale for resurrecting d_namlen
- The name length is already calculated and known to all OS kernels. It should be obvious that BSD systems already have d_namlen built in at every level since it’s available at user level already; this includes all versions of macOS since 2001. The Linux kernel has the name length flying around under the hood but the getdents() system call API doesn’t include it. Just like Linux, the Windows FindFirstFile() call doesn’t make it available but every Windows Driver Model (WDM) FILE_OBJECT has a FileName member that includes the length in bytes.
- Many C programs need the file name length and use strlen() or similar behavior to get it. Examples include jdupes (loaddir.c), rsync (flist.c), BusyBox (libbb), GNU Coreutils (ls.c).
- Knowing length ahead of time opens up several optimizations. A great example is using memcpy() instead of strcpy() which can take advantage of SSE and AVX to move the data more quickly. An even better example is simply skipping string length calculations altogether as seen in the current jdupes development branch.
How libjodycode is bringing it back
The original idea to pass file name length from readdir() calls back into applications came to me while working on the Windows side of libjodycode. The Windows Unicode support requires that file names read with e.g. FindFirstFileW() have their lengths counted for allocation operations. Why not pass that completed work into jdupes? Why should jdupes always have to use strlen(dirent->d_name), duplicating the work we’ve already done in libjodycode? Exploring this idea is how I discovered d_namlen and decided to include it in the Windows definition of a libjodycode dirent structure so it could be passed along. Most of the Linux/BSD/macOS side of libjodycode functions as a pass-through; that is, jc_readdir() just calls readdir() and uses the existing struct dirent definitions for whatever system it’s built on. Adding d_namlen would require tons of extra data copying that would hurt far more than having d_namlen would help. Even worse: Linux provides no equivalent, so d_namlen would be calculated with strlen() even if not used later.
Enter jc_get_d_namlen()! This new function included beside jc_readdir() allows a libjodycode program to extract the length of d_name in the most efficient way possible on the platform. On Windows it takes advantage of the d_namlen as provided by JC_DIRENT. On BSD and macOS it uses d_namlen already provided by struct dirent.
Remember that 1995 opinion by Linus about d_reclen making sense and how I said he was wrong? On Linux, structs are padded to 4- or 8-byte boundaries for efficiency, so doing some math against d_reclen only gives you the allocated size of the name, not the actual name length. Fortunately, this still makes it possible to skip over part of the name without checking it. In the absence of d_namlen provided by either the OS or a JC_DIRENT, the d_reclen size is used to calculate a skip count, then perform strlen() only on the last few bytes.
Of course, if d_reclen and d_namlen are both unavailable, jc_get_d_namlen() simply calls strlen() without any other work.
Synthetic benchmarks on Linux non-recursively running jc_readdir() 20,000 times against /usr/include and using write() to print the contents show that the jc_get_d_namlen() code is up to 13% faster than using strlen() directly, with Valgrind showing a similar drop in total CPU instructions executed. The worst performance boost I managed to achieve in all of my benchmarking was 0.7%. BSD/macOS and Windows should see even larger performance improvements since d_namlen is directly available and requires none of the work behind the d_reclen skip. I encourage anyone reading this and writing Linux C programs to steal my d_reclen skip code and see how much of a difference it makes.
The moral of the story is that only a fool duplicates their effort just to end up in the exact same place as the first time around.




