News: jdupes 1.28.0 released

jdupes 1.28.0 released

I’ve been working on a major rewrite of jdupes which will eventually become version 2, but it’s been almost a year since a new jdupes version was released and several minor issues have come up since 1.27.3 was released. I’ve created a new v1.x branch and I’ll continue to release backported updates and fixes when possible.

Get jdupes 1.28.0 in the Codebeg repository release area. This release fixes some issues affecting the in-development hash database feature and makes interactive deletion much safer.

Stop using –long-command-line-options

I’ve developed a pet peeve over time as I’ve seen others show their command line usage of jdupes in issue reports and examples. I’m sick of seeing long options.

The only way in which long options are good is that they are self-documenting. If you see --recurse then you have a pretty good idea what the option does while a simple -r doesn’t tell you anything meaningful. This is helpful to people who have no clue what’s going on or what they’re doing because they don’t need to look up anything. That’s where the value ends.

The case against long options

Rather than subject you to a wordy rant, here’s a concise list of why long options shouldn’t be used for almost everyone that’s not a total newbie:

  • Too much unnecessary typing. jdupes -rLHZQO1y . takes a lot less time to type than jdupes --recurse --link-hard --hard-links --soft-abort --quick --param-order --one-file-system --hash-db=.
  • They’re not actually self-documenting. What’s the difference between --link-hard and --hard-links in the previous command? The only way to know is to read the help text, but that’s the same thing you’d do to learn what the single-character options do.
  • The apparent self-documenting nature is deceptive. It can lead people to not read the help text and understand what things actually do. -Q/--quick is a speed tradeoff has a small but serious risk of data loss. One check of the help text to find out what -Q means will reveal this, but --quick implies faster, so why check?
  • Higher verbosity is more difficult to work within. Imagine working with the short and long examples above. If you know what a character option does, you can mentally parse the intent of -rLHZQ1Oy . very quickly, but the long option counterpart takes more time to scan and absorb. Editing a command line full of long options is tedious, especially if you don’t know that CTRL+left/right skips entire words at a time. It is also mentally very easy to swap “hard links” and “link hard” since they are nearly identical and cognitive swap errors are not uncommon.
  • Having both short and long options can double cognitive load. I need to deal with scripts or issue reports and they could use long options instead of short options. Now I have two separate things to memorize instead of one. I actually had to look up several long options for the above examples because I never use them.

I don’t think that long options should completely go away. They have their place, especially if a complex program starts to exhaust the limited set of single-character options available; in fact, this is an issue which jdupes is already brushing up against, with several options getting random characters because there are only two ways to write the letter R or L or S. I think long options are a niche tool that shouldn’t be used for most purposes. hdparm notoriously uses --please-destroy-my-drive as gatekeeping for options with very dangerous consequences and I think that’s fine, but when someone’s using my duplicate scanner tool to find duplicates including hard links and delete them, I’m always going to get a little angry inside when I see

--recurse --hard-links --delete --no-prompt

instead of

-rHdN

Bring back struct dirent->d_namlen

Most UNIX-like operating systems built around the C programming language adhere to two standards: POSIX.1 and the Single UNIX Specification. The specific part we’re interested in for this article is the structure of a directory entry (“struct dirent”) as specified in dirent.h. The only field of a dirent that’s guaranteed to exist is the actual file name of the entry (d_name), but many other optional dirent members exist out there. The inode or serial number of the entry (d_ino, or d_fileno on BSD) is technically an X/Open System Interface (XSI) extension but almost everything in use today provides it. d_off, d_reclen, and d_type are all available on most UNIX-like systems. The variances in what dirent members are supported by each compilation environment are a notable source of compilation failures, portability issues, and general confusion.

What we’re interested in today is a lesser-known dirent member that is available on BSD systems, QNX (under a slightly different name), and not much else: d_namlen, the length of d_name. SUS doesn’t specify it, and neither does POSIX.1 or Android’s Bionic C library. glibc and some other C libraries provide a macro _DIRENT_HAVE_D_NAMLEN to discover that it’s not supported on most systems. Linus Torvalds himself has said that” d_namlen should go away” while “d_reclen actually makes sense.” It’s this claim combined with my recent experience with libjodycode that has motivated this article.

Linux, POSIX, and SUS have all shunned d_namlen. The logic often seen for its rejection is “just do strlen(dirent->d_name) instead because that’s all it is anyway” and the most rationale I’ve seen has come from Linus Torvalds in 1995:

I personally would like to totally forget about “d_namlen”, for a couple of reasons:
– it’s not POSIX. Thus program which uses it is broken.
– SVR4 doesn’t seem to have it, so programs that use it are doubly broken.
– it’s useless. Any broken program which uses it can trivially be altered to use “strlen(dirent->d_name)” instead.
…Compared to d_namlen, d_reclen actually makes sense.

This isn’t good enough for me. Standards like POSIX define what shall be present but generally don’t prohibit providing more. SUSv2 doesn’t have d_off, d_reclen, d_namlen, or d_type. The same is true for POSIX.1-2017. That leaves the third point: “it’s useless.” This is objectively incorrect. I’m here to argue that we should bring back d_namlen and enjoy the improvement in software performance that it can bring to the table.

Rationale for resurrecting d_namlen

How libjodycode is bringing it back

The original idea to pass file name length from readdir() calls back into applications came to me while working on the Windows side of libjodycode. The Windows Unicode support requires that file names read with e.g. FindFirstFileW() have their lengths counted for allocation operations. Why not pass that completed work into jdupes? Why should jdupes always have to use strlen(dirent->d_name), duplicating the work we’ve already done in libjodycode? Exploring this idea is how I discovered d_namlen and decided to include it in the Windows definition of a libjodycode dirent structure so it could be passed along. Most of the Linux/BSD/macOS side of libjodycode functions as a pass-through; that is, jc_readdir() just calls readdir() and uses the existing struct dirent definitions for whatever system it’s built on. Adding d_namlen would require tons of extra data copying that would hurt far more than having d_namlen would help. Even worse: Linux provides no equivalent, so d_namlen would be calculated with strlen() even if not used later.

Enter jc_get_d_namlen()! This new function included beside jc_readdir() allows a libjodycode program to extract the length of d_name in the most efficient way possible on the platform. On Windows it takes advantage of the d_namlen as provided by JC_DIRENT. On BSD and macOS it uses d_namlen already provided by struct dirent.

Remember that 1995 opinion by Linus about d_reclen making sense and how I said he was wrong? On Linux, structs are padded to 4- or 8-byte boundaries for efficiency, so doing some math against d_reclen only gives you the allocated size of the name, not the actual name length. Fortunately, this still makes it possible to skip over part of the name without checking it. In the absence of d_namlen provided by either the OS or a JC_DIRENT, the d_reclen size is used to calculate a skip count, then perform strlen() only on the last few bytes.

Of course, if d_reclen and d_namlen are both unavailable, jc_get_d_namlen() simply calls strlen() without any other work.

Synthetic benchmarks on Linux non-recursively running jc_readdir() 20,000 times against /usr/include and using write() to print the contents show that the jc_get_d_namlen() code is up to 13% faster than using strlen() directly, with Valgrind showing a similar drop in total CPU instructions executed. The worst performance boost I managed to achieve in all of my benchmarking was 0.7%. BSD/macOS and Windows should see even larger performance improvements since d_namlen is directly available and requires none of the work behind the d_reclen skip. I encourage anyone reading this and writing Linux C programs to steal my d_reclen skip code and see how much of a difference it makes.

The moral of the story is that only a fool duplicates their effort just to end up in the exact same place as the first time around.

News: working toward jdupes 2.0

Working toward jdupes 2.0

I have wanted to rewrite jdupes for a very long time. Some of the decisions inherited from fdupes have been a real pain when trying to add features. Unfortunately, every attempt at a comtoreplete rewrite was too daunting of a task, and several major overhauls have been started and subsequently abandoned. One of them has finally stuck, though, and it’s well on the way to becoming worthy of the version 2.0 milestone.

The biggest goal I’ve had is to replace the “file tree” data structure with something more flexible; thus, the size tree was born. There is no way for two files to be identical if their sizes are different. Grouping files into sizes before further work makes a lot of sense! I’ve also been replacing code containing “knowledge” of the actual structures behind file information storage with more generic “get next thing” style interfaces that are more flexible in nature.

One of the long-term goals is to parallelize jdupes. The widespread availability of multi-core processors, solid-state storage, and RAID storage means that a lot of opportunities to speed things up are lost in the current simple single-threaded operation paradigm. Abstracting out things in the right way will make it far easier to split work into multiple threads where it makes sense to do so. A specific case where a ton of performance is being left on the table is when files are being compared that are on different disks; a close second is exploiting the internally parallel nature of modern SSDs to perform parallel I/O on the same SSD.

As of this news post, I’m in the process of writing a “query” interface that will give other code modules a standard way to acquire information about discovered duplicate sets and make better decisions. The file tree information model made it very difficult to add features such as hard linking to the files with the highest hard link count first or intelligently establishing hard links when duplicates exist on different volumes. This rigidity led to a long-standing problem with reproducible builds in Debian that I couldn’t solve very easily. The only real solution to the problem of reproducible results was to nuke the file tree paradigm and the tree-specific code in all of the action modules in favor of a new system that allows for querying and sorting the data properly after duplicate scanning is done.

News: fix for v1.27.3, libjodycode 4 work

I’ve backported a fix in the current development version of jdupes to v1.27.3 which avoids losing the existing hash database on a write failure:

This fix writes out the in-memory hash database to a new temporary file, then removes the existing database and renames the temporary file to the original file name. If anything goes wrong during creation of the new database then the original database is not overwritten or deleted. I ran into this issue because the disk I was deduplicating ran out of disk space and the hash database file was destroyed as a result. I’ve emailed the patch to several jdupes distribution package maintainers but I can’t possibly find everyone. This patch should be applied for anyone who is distribution v1.27.3 anywhere.

I’ve been working on a major update to libjodycode and I’ve converted most of the file management functions called by jdupes that need Windows-specific support into libjodycode wrapper functions. Most of the functions simply pass through the call on non-Windows platforms. I’m hoping to have a few more basic functions ready to go before libjodycode 4 is officially released.