News: GitHub is OUT, Codeberg is IN

No more forced 2FA

jdupes, libjodycode, and all of my other software projects are migrating to Codeberg immediately.

Codeberg is not ideal due to the presence of rules lawyering-prone language in the “Allowed Content & Usage” section of the Terms of Service, but they are a major improvement over GitHub because they don’t force 2FA and they aren’t run by Microsoft. The GitHub AUP has abusive language similar to the Codeberg TOS so the ideological toxicity of this new overlord is no different than that of the old one. If any issues ever arise I’ll just fire up my own Git server and give all the overlords the finger.

Killing GitHub through attrition

Once migration is complete, I’ll be locking the GitHub repos as read-only forever, though I am inclined to delete them outright since the only power people have over giant corporations is no longer feeding them your time, money, content, and other resources. That’s why I destroyed my several years of contributions to Reddit. It’s more convenient to use the offerings of these companies, but using them results in horrible abuses of the users and stifling of freedoms in the long run.

News: major libjodycode rework, GitHub exit on hold, hashdb is great

libjodycode work in progress

I am in the middle of a major upgrade to libjodycode that will convert the library from a random collection of code I use across several of my projects into a Windows C Unicode portability library. One of the major disadvantages of compiling against the MinGW runtime is the lack of native Windows Unicode support. The largest part of libjodycode has always been the Windows Unicode support code that transparently converts to and from UTF-16 and handles quirky differences between Windows and UNIX-like platforms. What I’m working on is a conversion from a cobbled-together mess of code I copy-pasted from jdupes into a library that effectively replaces several basic C library calls such as access() and fopen() with versions that work with Windows Unicode without modification.

Unfortunately, while I’m working on libjodycode and all of my programs that interface with it to generalize everything out, you’re guaranteed to encounter problems if you try to use the latest master branches for libjodycode and jdupes. These will continue to be broken without notice until I’m done overhauling libjodycode.

GitHub exit on hold

In the August 28, 2023 news post, I said that I’d be leaving GitHub, meaning jdupes would also be leaving GitHub and moving somewhere else. It’s October 4 and that hasn’t happened yet. The (brief) reasons for this are:

  • GitHub replaced the insane CAPTCHA for setting up 2FA with something reasonable that actually worked when I tried it, and
  • I moved my computer consultancy and repair business out of my home and into a real office again–a process that put me behind on work by several weeks.

jdupes, libjodycode, and all my other software projects will still be removed from GitHub in the future, but the timeline is not certain right now. I’ll post another news item when I have a concrete transition date.

Hash database feature is awesome!

Remember that hash database feature I added to jdupes in August 2023? I’ve been using it on lots of different data sets since then and I’m amazed at how much time it saves! One of my residential PC repair customers doesn’t delete the pictures from the phone after phone picture import sessions; I was able to give them a one-line batch file that uses jdupes with the hashdb feature to remove all duplicate photo files in just a few seconds. They can run it immediately after importing photos to clean up after the import rather than changing their existing habits. Despite hashdb missing several basic features, it’s still insanely useful when used carefully.

News: hard link hash cross-copying

It’s faster to not do the same work twice

The next release of jdupes will have a feature that users won’t see but that will make a big difference for people who scan lots of hard-linked files: cross-copied hashes. When jdupes detects that two files being considered for comparison are hard linked and one of those files has already had some hashing work performed, the file with more work done will have its hashes copied to the file with less work done. This can make a big difference in performance since it eliminates extra work.

In jdupes, files are checked in pairs. If four duplicate files (A B C D) are to be checked then six comparisons need to be done (AB, AC, AD, BC, BD, CD). The way that jdupes worked in ancient times would require all four files to be read in their entirety three times each in order to reliably state that all of them are duplicates of one another. One day, the hard link matching optimization was added, making any hard linked files match with no reads at all. If two files (let’s say A and B) were hard linked, they’re literally the same piece of data on disk, thus they’re obvious duplicates without checking any data. However, while this eliminated the two full file reads that would be triggered by the AB candidate pair, this didn’t help at all with the other five pairs. The hard link relationship doesn’t apply to them, so it makes no difference in those matches.

However, there is an incredibly easy optimization that can be applied here: if the AB comparison happens after any other comparison (AC, BD, whatever) that required hashing A or B as part of the process then the hashes from one applies to the other. This is where cross-copied hashes come into play. Let’s say that the AC pair is scanned and found to be duplicate, then the AB pair is scanned. A and C have been both partially and fully hashed but B hasn’t yet. The hard link match would normally mark both as matched and move on, but applying cross-copied hashes before moving on results in A’s hashes being copied to B, so now B has been fully hashed without reading B (which is the same data as A, after all!) That means one less full file read when the BC or BD pairs are checked. This seems small, but in a large data set with lots of hard links it can represent thousands of files that aren’t unnecessarily read twice. This can make the difference between data being blown out of the OS disk caches or not. It’s small but it can really add up.

News: hash database feature, cross-copying hard link hashes, jdupes leaving GitHub

Hash database: millions of files scanned in seconds

jdupes version 1.27.3 has been released with a long-awaited and heavily requested feature: a file hash database. While the feature is not as “smart” I’d like it to be, I’ve decided to release it as soon as possible because it really is that big of a deal. It’s probably not going to help you if you’re scanning a folder tree with a hundred files, but the hash database feature makes a massive difference for repeated jdupes scans of large data sets that don’t change much. I’ve personally run tests on a colossal pile of already de-duplicated random media content (mostly web spider-gathered images, videos, and text files–nearly 1.9 million files in total!) and on a Ryzen 5 5600G machine with the -y . option (use a hash database; the period aliases to jdupes _hashdb.txt) the comparisons after the first run finishes in only a few seconds; I tested the same run without the hash database feature and when it was 10% finished and I was tired of waiting, it had spent 71 seconds running unnecessary comparisons. When new files are added it only takes a few seconds to find and delete them.

The major flaw in the hash database feature is in how it handles file and folder paths. If you run jdupes -y . testdir and jdupes -y . ./testdir and jdupes -y . ../currentdir/testdir the database component sees those as completely different paths. This is an obvious flaw, but the workaround is to run jdupes from the same working directory each time and to specify the target path(s) the same way each time. This issue will be fixed in a future release.

Cross-copied hashes: faster scans by not repeating work

Every time jdupes examines a pair of files to see if they could be duplicates, the contents of both files are read and hashes (numbers based on a file’s contents that are used as a “shortcut” to quickly compare that file against other files) of the contents are generated, then they are compared to see if files should be examined even further. An upcoming performance enhancement to jdupes is hash cross-copying. This is where two files being compared during a scan are found to be hard-linked (they look like two separate files to the user but point to the same data and metadata on-disk) and the hashes are copied between the in-memory information about each file. Before this enhancement, two files that are literally the same file would be hashed and compared separately, potentially wasting a lot of time re-reading the same data to calculate the same hashes. While this enhancement doesn’t completely avoid this unnecessary work, it does avoid it after a hard-linked pair of files is detected. This can result in a significant performance boost. It will be included in the next release of jdupes.

Leaving GitHub: forced two-factor authentication is evil

The last news is also the worst. I (Jody Bruchon) intend to leave GitHub entirely by the end of September 2023. The main driver of this is GitHub implementing mandatory two-factor authentication (abbreviated 2FA, typically thought of by most people as “requiring a cell phone to log in”), though there are several other reasons for leaving that have built me up to this point. 2FA is a massive double-edged sword. I have run a PC repair shop for a very long time and I have seen countless customers lose their email and social media accounts forever due to a combination of being railed into 2FA they didn’t want plus losing access to the phone number or the actual phone itself that 2FA was set up to work with.

GitHub two-factor authentication blog post graphic
GitHub’s picture of a phone locking you out of your GitHub account. How appropriate.

The increased security offered by 2FA comes from requiring verification outside of the traditional “give me your password” system of logging in; someone who gets their hands on your password probably won’t also have your unlocked cell phone in their hand when they try to log into your account using that stolen password, and that’s how and why 2FA works in a nutshell. Unfortunately, this also means that your password isn’t enough for YOU to log in, either; if you lose access to your second factor, you lose your account even if you know your password, and that’s how and why 2FA is a bad thing in a nutshell. There are other aspects such as the use of “authenticator apps” on smartphones that pose additional tracking and privacy risks that SMS-code-based 2FA doesn’t, but this isn’t the appropriate place to discuss this topic in such depth.

The general plan is to make this website, jdupes.com, the official website for the jdupes and libjodycode projects, while all my other software projects will be re-homed to the Software page on jodybruchon.com. I haven’t figured out what I’ll do about hosting my Git repositories and an issue tracker yet. Keep checking the jdupes.com front page for updates to see where things land.