Codeberg Has Ideological Cancer

I received an interesting email notification in February. I was tagged in a Codeberg project called “truth” that contained only two things: a bunch of tagged usernames and the words “NIGGER BALLS” as the issue title. Even a child could tell that this is nothing more than classic internet trolling. I’ve watched 4chan on and off since 2006 (thus granting me the glorious title “oldfag”). Because I’ve been exposed for decades to no shortage of the insanity, hostility, and retardation that is the unfiltered, uncensored thoughts of real people in the real world, this was the funniest thing that had happened to me in weeks.

I laughed and laughed some more. For a brief moment, a Git hosting platform became a joke.

The problem? Codeberg’s management are fragile babies that can’t handle the real world without an entire ecosystem of fragile snowflake “safety” filter bubbles protecting them. Unfortunately, this sort of political extremist response to an obvious joke is a severe harbinger of ideological cancer that destroys everything it infects, from movie and video game franchises to major software projects. Yes, that’s right, Drupal played “Code of Conduct” and forced out Larry Garfield for things in his private sex life and his girlfriend being insultingly described as “unable to consent” due to “extreme autism” in July 2017 and subsequently lost 27% of its market share (from 4.8% to 3.5%) by 2019, despite a trend of only losing about 0.2% total market share per year prior.

For now, jdupes and libjodycode will remain on Codeberg, but a new host is needed since the site is clearly operated by extremist ideologues and I have no way of knowing what they will do to my code or my released executables. All it takes is one bad actor to destroy thousands of people’s data based on broad-brush ideological differences.

Where does jdupes “stand politically?”

This question is really just asking “where does Jody Bruchon stand politically?” and the answer is “it doesn’t matter.” I have consistently stated for a very long time that I want everyone to use my software and benefit from it, regardless of who they are or what they think. I share my work for the benefit of everyone. I am strongly opposed to bullshit license initiatives like “ethical source licenses.” There is nothing “open” about attempting to add authoritarian control over the actions of other people to an open source license. My software is available to everyone, even if you want to use it in the manufacture of nuclear weapons. That will never change.

“I’m offended by your use of a racial slur!”

News: jdupes 1.28.0 released

jdupes 1.28.0 released

I’ve been working on a major rewrite of jdupes which will eventually become version 2, but it’s been almost a year since a new jdupes version was released and several minor issues have come up since 1.27.3 was released. I’ve created a new v1.x branch and I’ll continue to release backported updates and fixes when possible.

Get jdupes 1.28.0 in the Codebeg repository release area. This release fixes some issues affecting the in-development hash database feature and makes interactive deletion much safer.

Stop using –long-command-line-options

I’ve developed a pet peeve over time as I’ve seen others show their command line usage of jdupes in issue reports and examples. I’m sick of seeing long options.

The only way in which long options are good is that they are self-documenting. If you see --recurse then you have a pretty good idea what the option does while a simple -r doesn’t tell you anything meaningful. This is helpful to people who have no clue what’s going on or what they’re doing because they don’t need to look up anything. That’s where the value ends.

The case against long options

Rather than subject you to a wordy rant, here’s a concise list of why long options shouldn’t be used for almost everyone that’s not a total newbie:

  • Too much unnecessary typing. jdupes -rLHZQO1y . takes a lot less time to type than jdupes --recurse --link-hard --hard-links --soft-abort --quick --param-order --one-file-system --hash-db=.
  • They’re not actually self-documenting. What’s the difference between --link-hard and --hard-links in the previous command? The only way to know is to read the help text, but that’s the same thing you’d do to learn what the single-character options do.
  • The apparent self-documenting nature is deceptive. It can lead people to not read the help text and understand what things actually do. -Q/--quick is a speed tradeoff has a small but serious risk of data loss. One check of the help text to find out what -Q means will reveal this, but --quick implies faster, so why check?
  • Higher verbosity is more difficult to work within. Imagine working with the short and long examples above. If you know what a character option does, you can mentally parse the intent of -rLHZQ1Oy . very quickly, but the long option counterpart takes more time to scan and absorb. Editing a command line full of long options is tedious, especially if you don’t know that CTRL+left/right skips entire words at a time. It is also mentally very easy to swap “hard links” and “link hard” since they are nearly identical and cognitive swap errors are not uncommon.
  • Having both short and long options can double cognitive load. I need to deal with scripts or issue reports and they could use long options instead of short options. Now I have two separate things to memorize instead of one. I actually had to look up several long options for the above examples because I never use them.

I don’t think that long options should completely go away. They have their place, especially if a complex program starts to exhaust the limited set of single-character options available; in fact, this is an issue which jdupes is already brushing up against, with several options getting random characters because there are only two ways to write the letter R or L or S. I think long options are a niche tool that shouldn’t be used for most purposes. hdparm notoriously uses --please-destroy-my-drive as gatekeeping for options with very dangerous consequences and I think that’s fine, but when someone’s using my duplicate scanner tool to find duplicates including hard links and delete them, I’m always going to get a little angry inside when I see

--recurse --hard-links --delete --no-prompt

instead of

-rHdN

Bring back struct dirent->d_namlen

Most UNIX-like operating systems built around the C programming language adhere to two standards: POSIX.1 and the Single UNIX Specification. The specific part we’re interested in for this article is the structure of a directory entry (“struct dirent”) as specified in dirent.h. The only field of a dirent that’s guaranteed to exist is the actual file name of the entry (d_name), but many other optional dirent members exist out there. The inode or serial number of the entry (d_ino, or d_fileno on BSD) is technically an X/Open System Interface (XSI) extension but almost everything in use today provides it. d_off, d_reclen, and d_type are all available on most UNIX-like systems. The variances in what dirent members are supported by each compilation environment are a notable source of compilation failures, portability issues, and general confusion.

What we’re interested in today is a lesser-known dirent member that is available on BSD systems, QNX (under a slightly different name), and not much else: d_namlen, the length of d_name. SUS doesn’t specify it, and neither does POSIX.1 or Android’s Bionic C library. glibc and some other C libraries provide a macro _DIRENT_HAVE_D_NAMLEN to discover that it’s not supported on most systems. Linus Torvalds himself has said that” d_namlen should go away” while “d_reclen actually makes sense.” It’s this claim combined with my recent experience with libjodycode that has motivated this article.

Linux, POSIX, and SUS have all shunned d_namlen. The logic often seen for its rejection is “just do strlen(dirent->d_name) instead because that’s all it is anyway” and the most rationale I’ve seen has come from Linus Torvalds in 1995:

I personally would like to totally forget about “d_namlen”, for a couple of reasons:
– it’s not POSIX. Thus program which uses it is broken.
– SVR4 doesn’t seem to have it, so programs that use it are doubly broken.
– it’s useless. Any broken program which uses it can trivially be altered to use “strlen(dirent->d_name)” instead.
…Compared to d_namlen, d_reclen actually makes sense.

This isn’t good enough for me. Standards like POSIX define what shall be present but generally don’t prohibit providing more. SUSv2 doesn’t have d_off, d_reclen, d_namlen, or d_type. The same is true for POSIX.1-2017. That leaves the third point: “it’s useless.” This is objectively incorrect. I’m here to argue that we should bring back d_namlen and enjoy the improvement in software performance that it can bring to the table.

Rationale for resurrecting d_namlen

How libjodycode is bringing it back

The original idea to pass file name length from readdir() calls back into applications came to me while working on the Windows side of libjodycode. The Windows Unicode support requires that file names read with e.g. FindFirstFileW() have their lengths counted for allocation operations. Why not pass that completed work into jdupes? Why should jdupes always have to use strlen(dirent->d_name), duplicating the work we’ve already done in libjodycode? Exploring this idea is how I discovered d_namlen and decided to include it in the Windows definition of a libjodycode dirent structure so it could be passed along. Most of the Linux/BSD/macOS side of libjodycode functions as a pass-through; that is, jc_readdir() just calls readdir() and uses the existing struct dirent definitions for whatever system it’s built on. Adding d_namlen would require tons of extra data copying that would hurt far more than having d_namlen would help. Even worse: Linux provides no equivalent, so d_namlen would be calculated with strlen() even if not used later.

Enter jc_get_d_namlen()! This new function included beside jc_readdir() allows a libjodycode program to extract the length of d_name in the most efficient way possible on the platform. On Windows it takes advantage of the d_namlen as provided by JC_DIRENT. On BSD and macOS it uses d_namlen already provided by struct dirent.

Remember that 1995 opinion by Linus about d_reclen making sense and how I said he was wrong? On Linux, structs are padded to 4- or 8-byte boundaries for efficiency, so doing some math against d_reclen only gives you the allocated size of the name, not the actual name length. Fortunately, this still makes it possible to skip over part of the name without checking it. In the absence of d_namlen provided by either the OS or a JC_DIRENT, the d_reclen size is used to calculate a skip count, then perform strlen() only on the last few bytes.

Of course, if d_reclen and d_namlen are both unavailable, jc_get_d_namlen() simply calls strlen() without any other work.

Synthetic benchmarks on Linux non-recursively running jc_readdir() 20,000 times against /usr/include and using write() to print the contents show that the jc_get_d_namlen() code is up to 13% faster than using strlen() directly, with Valgrind showing a similar drop in total CPU instructions executed. The worst performance boost I managed to achieve in all of my benchmarking was 0.7%. BSD/macOS and Windows should see even larger performance improvements since d_namlen is directly available and requires none of the work behind the d_reclen skip. I encourage anyone reading this and writing Linux C programs to steal my d_reclen skip code and see how much of a difference it makes.

The moral of the story is that only a fool duplicates their effort just to end up in the exact same place as the first time around.

News: working toward jdupes 2.0

Working toward jdupes 2.0

I have wanted to rewrite jdupes for a very long time. Some of the decisions inherited from fdupes have been a real pain when trying to add features. Unfortunately, every attempt at a comtoreplete rewrite was too daunting of a task, and several major overhauls have been started and subsequently abandoned. One of them has finally stuck, though, and it’s well on the way to becoming worthy of the version 2.0 milestone.

The biggest goal I’ve had is to replace the “file tree” data structure with something more flexible; thus, the size tree was born. There is no way for two files to be identical if their sizes are different. Grouping files into sizes before further work makes a lot of sense! I’ve also been replacing code containing “knowledge” of the actual structures behind file information storage with more generic “get next thing” style interfaces that are more flexible in nature.

One of the long-term goals is to parallelize jdupes. The widespread availability of multi-core processors, solid-state storage, and RAID storage means that a lot of opportunities to speed things up are lost in the current simple single-threaded operation paradigm. Abstracting out things in the right way will make it far easier to split work into multiple threads where it makes sense to do so. A specific case where a ton of performance is being left on the table is when files are being compared that are on different disks; a close second is exploiting the internally parallel nature of modern SSDs to perform parallel I/O on the same SSD.

As of this news post, I’m in the process of writing a “query” interface that will give other code modules a standard way to acquire information about discovered duplicate sets and make better decisions. The file tree information model made it very difficult to add features such as hard linking to the files with the highest hard link count first or intelligently establishing hard links when duplicates exist on different volumes. This rigidity led to a long-standing problem with reproducible builds in Debian that I couldn’t solve very easily. The only real solution to the problem of reproducible results was to nuke the file tree paradigm and the tree-specific code in all of the action modules in favor of a new system that allows for querying and sorting the data properly after duplicate scanning is done.

News: fix for v1.27.3, libjodycode 4 work

I’ve backported a fix in the current development version of jdupes to v1.27.3 which avoids losing the existing hash database on a write failure:

This fix writes out the in-memory hash database to a new temporary file, then removes the existing database and renames the temporary file to the original file name. If anything goes wrong during creation of the new database then the original database is not overwritten or deleted. I ran into this issue because the disk I was deduplicating ran out of disk space and the hash database file was destroyed as a result. I’ve emailed the patch to several jdupes distribution package maintainers but I can’t possibly find everyone. This patch should be applied for anyone who is distribution v1.27.3 anywhere.

I’ve been working on a major update to libjodycode and I’ve converted most of the file management functions called by jdupes that need Windows-specific support into libjodycode wrapper functions. Most of the functions simply pass through the call on non-Windows platforms. I’m hoping to have a few more basic functions ready to go before libjodycode 4 is officially released.

News: GitHub is OUT, Codeberg is IN

No more forced 2FA

jdupes, libjodycode, and all of my other software projects are migrating to Codeberg immediately.

Codeberg is not ideal due to the presence of rules lawyering-prone language in the “Allowed Content & Usage” section of the Terms of Service, but they are a major improvement over GitHub because they don’t force 2FA and they aren’t run by Microsoft. The GitHub AUP has abusive language similar to the Codeberg TOS so the ideological toxicity of this new overlord is no different than that of the old one. If any issues ever arise I’ll just fire up my own Git server and give all the overlords the finger.

Killing GitHub through attrition

Once migration is complete, I’ll be locking the GitHub repos as read-only forever, though I am inclined to delete them outright since the only power people have over giant corporations is no longer feeding them your time, money, content, and other resources. That’s why I destroyed my several years of contributions to Reddit. It’s more convenient to use the offerings of these companies, but using them results in horrible abuses of the users and stifling of freedoms in the long run.

News: major libjodycode rework, GitHub exit on hold, hashdb is great

libjodycode work in progress

I am in the middle of a major upgrade to libjodycode that will convert the library from a random collection of code I use across several of my projects into a Windows C Unicode portability library. One of the major disadvantages of compiling against the MinGW runtime is the lack of native Windows Unicode support. The largest part of libjodycode has always been the Windows Unicode support code that transparently converts to and from UTF-16 and handles quirky differences between Windows and UNIX-like platforms. What I’m working on is a conversion from a cobbled-together mess of code I copy-pasted from jdupes into a library that effectively replaces several basic C library calls such as access() and fopen() with versions that work with Windows Unicode without modification.

Unfortunately, while I’m working on libjodycode and all of my programs that interface with it to generalize everything out, you’re guaranteed to encounter problems if you try to use the latest master branches for libjodycode and jdupes. These will continue to be broken without notice until I’m done overhauling libjodycode.

GitHub exit on hold

In the August 28, 2023 news post, I said that I’d be leaving GitHub, meaning jdupes would also be leaving GitHub and moving somewhere else. It’s October 4 and that hasn’t happened yet. The (brief) reasons for this are:

  • GitHub replaced the insane CAPTCHA for setting up 2FA with something reasonable that actually worked when I tried it, and
  • I moved my computer consultancy and repair business out of my home and into a real office again–a process that put me behind on work by several weeks.

jdupes, libjodycode, and all my other software projects will still be removed from GitHub in the future, but the timeline is not certain right now. I’ll post another news item when I have a concrete transition date.

Hash database feature is awesome!

Remember that hash database feature I added to jdupes in August 2023? I’ve been using it on lots of different data sets since then and I’m amazed at how much time it saves! One of my residential PC repair customers doesn’t delete the pictures from the phone after phone picture import sessions; I was able to give them a one-line batch file that uses jdupes with the hashdb feature to remove all duplicate photo files in just a few seconds. They can run it immediately after importing photos to clean up after the import rather than changing their existing habits. Despite hashdb missing several basic features, it’s still insanely useful when used carefully.

News: hard link hash cross-copying

It’s faster to not do the same work twice

The next release of jdupes will have a feature that users won’t see but that will make a big difference for people who scan lots of hard-linked files: cross-copied hashes. When jdupes detects that two files being considered for comparison are hard linked and one of those files has already had some hashing work performed, the file with more work done will have its hashes copied to the file with less work done. This can make a big difference in performance since it eliminates extra work.

In jdupes, files are checked in pairs. If four duplicate files (A B C D) are to be checked then six comparisons need to be done (AB, AC, AD, BC, BD, CD). The way that jdupes worked in ancient times would require all four files to be read in their entirety three times each in order to reliably state that all of them are duplicates of one another. One day, the hard link matching optimization was added, making any hard linked files match with no reads at all. If two files (let’s say A and B) were hard linked, they’re literally the same piece of data on disk, thus they’re obvious duplicates without checking any data. However, while this eliminated the two full file reads that would be triggered by the AB candidate pair, this didn’t help at all with the other five pairs. The hard link relationship doesn’t apply to them, so it makes no difference in those matches.

However, there is an incredibly easy optimization that can be applied here: if the AB comparison happens after any other comparison (AC, BD, whatever) that required hashing A or B as part of the process then the hashes from one applies to the other. This is where cross-copied hashes come into play. Let’s say that the AC pair is scanned and found to be duplicate, then the AB pair is scanned. A and C have been both partially and fully hashed but B hasn’t yet. The hard link match would normally mark both as matched and move on, but applying cross-copied hashes before moving on results in A’s hashes being copied to B, so now B has been fully hashed without reading B (which is the same data as A, after all!) That means one less full file read when the BC or BD pairs are checked. This seems small, but in a large data set with lots of hard links it can represent thousands of files that aren’t unnecessarily read twice. This can make the difference between data being blown out of the OS disk caches or not. It’s small but it can really add up.

News: hash database feature, cross-copying hard link hashes, jdupes leaving GitHub

Hash database: millions of files scanned in seconds

jdupes version 1.27.3 has been released with a long-awaited and heavily requested feature: a file hash database. While the feature is not as “smart” I’d like it to be, I’ve decided to release it as soon as possible because it really is that big of a deal. It’s probably not going to help you if you’re scanning a folder tree with a hundred files, but the hash database feature makes a massive difference for repeated jdupes scans of large data sets that don’t change much. I’ve personally run tests on a colossal pile of already de-duplicated random media content (mostly web spider-gathered images, videos, and text files–nearly 1.9 million files in total!) and on a Ryzen 5 5600G machine with the -y . option (use a hash database; the period aliases to jdupes _hashdb.txt) the comparisons after the first run finishes in only a few seconds; I tested the same run without the hash database feature and when it was 10% finished and I was tired of waiting, it had spent 71 seconds running unnecessary comparisons. When new files are added it only takes a few seconds to find and delete them.

The major flaw in the hash database feature is in how it handles file and folder paths. If you run jdupes -y . testdir and jdupes -y . ./testdir and jdupes -y . ../currentdir/testdir the database component sees those as completely different paths. This is an obvious flaw, but the workaround is to run jdupes from the same working directory each time and to specify the target path(s) the same way each time. This issue will be fixed in a future release.

Cross-copied hashes: faster scans by not repeating work

Every time jdupes examines a pair of files to see if they could be duplicates, the contents of both files are read and hashes (numbers based on a file’s contents that are used as a “shortcut” to quickly compare that file against other files) of the contents are generated, then they are compared to see if files should be examined even further. An upcoming performance enhancement to jdupes is hash cross-copying. This is where two files being compared during a scan are found to be hard-linked (they look like two separate files to the user but point to the same data and metadata on-disk) and the hashes are copied between the in-memory information about each file. Before this enhancement, two files that are literally the same file would be hashed and compared separately, potentially wasting a lot of time re-reading the same data to calculate the same hashes. While this enhancement doesn’t completely avoid this unnecessary work, it does avoid it after a hard-linked pair of files is detected. This can result in a significant performance boost. It will be included in the next release of jdupes.

Leaving GitHub: forced two-factor authentication is evil

The last news is also the worst. I (Jody Bruchon) intend to leave GitHub entirely by the end of September 2023. The main driver of this is GitHub implementing mandatory two-factor authentication (abbreviated 2FA, typically thought of by most people as “requiring a cell phone to log in”), though there are several other reasons for leaving that have built me up to this point. 2FA is a massive double-edged sword. I have run a PC repair shop for a very long time and I have seen countless customers lose their email and social media accounts forever due to a combination of being railed into 2FA they didn’t want plus losing access to the phone number or the actual phone itself that 2FA was set up to work with.

GitHub two-factor authentication blog post graphic
GitHub’s picture of a phone locking you out of your GitHub account. How appropriate.

The increased security offered by 2FA comes from requiring verification outside of the traditional “give me your password” system of logging in; someone who gets their hands on your password probably won’t also have your unlocked cell phone in their hand when they try to log into your account using that stolen password, and that’s how and why 2FA works in a nutshell. Unfortunately, this also means that your password isn’t enough for YOU to log in, either; if you lose access to your second factor, you lose your account even if you know your password, and that’s how and why 2FA is a bad thing in a nutshell. There are other aspects such as the use of “authenticator apps” on smartphones that pose additional tracking and privacy risks that SMS-code-based 2FA doesn’t, but this isn’t the appropriate place to discuss this topic in such depth.

The general plan is to make this website, jdupes.com, the official website for the jdupes and libjodycode projects, while all my other software projects will be re-homed to the Software page on jodybruchon.com. I haven’t figured out what I’ll do about hosting my Git repositories and an issue tracker yet. Keep checking the jdupes.com front page for updates to see where things land.