News: major libjodycode rework, GitHub exit on hold, hashdb is great

libjodycode work in progress

I am in the middle of a major upgrade to libjodycode that will convert the library from a random collection of code I use across several of my projects into a Windows C Unicode portability library. One of the major disadvantages of compiling against the MinGW runtime is the lack of native Windows Unicode support. The largest part of libjodycode has always been the Windows Unicode support code that transparently converts to and from UTF-16 and handles quirky differences between Windows and UNIX-like platforms. What I’m working on is a conversion from a cobbled-together mess of code I copy-pasted from jdupes into a library that effectively replaces several basic C library calls such as access() and fopen() with versions that work with Windows Unicode without modification.

Unfortunately, while I’m working on libjodycode and all of my programs that interface with it to generalize everything out, you’re guaranteed to encounter problems if you try to use the latest master branches for libjodycode and jdupes. These will continue to be broken without notice until I’m done overhauling libjodycode.

GitHub exit on hold

In the August 28, 2023 news post, I said that I’d be leaving GitHub, meaning jdupes would also be leaving GitHub and moving somewhere else. It’s October 4 and that hasn’t happened yet. The (brief) reasons for this are:

  • GitHub replaced the insane CAPTCHA for setting up 2FA with something reasonable that actually worked when I tried it, and
  • I moved my computer consultancy and repair business out of my home and into a real office again–a process that put me behind on work by several weeks.

jdupes, libjodycode, and all my other software projects will still be removed from GitHub in the future, but the timeline is not certain right now. I’ll post another news item when I have a concrete transition date.

Hash database feature is awesome!

Remember that hash database feature I added to jdupes in August 2023? I’ve been using it on lots of different data sets since then and I’m amazed at how much time it saves! One of my residential PC repair customers doesn’t delete the pictures from the phone after phone picture import sessions; I was able to give them a one-line batch file that uses jdupes with the hashdb feature to remove all duplicate photo files in just a few seconds. They can run it immediately after importing photos to clean up after the import rather than changing their existing habits. Despite hashdb missing several basic features, it’s still insanely useful when used carefully.

News: hash database feature, cross-copying hard link hashes, jdupes leaving GitHub

Hash database: millions of files scanned in seconds

jdupes version 1.27.3 has been released with a long-awaited and heavily requested feature: a file hash database. While the feature is not as “smart” I’d like it to be, I’ve decided to release it as soon as possible because it really is that big of a deal. It’s probably not going to help you if you’re scanning a folder tree with a hundred files, but the hash database feature makes a massive difference for repeated jdupes scans of large data sets that don’t change much. I’ve personally run tests on a colossal pile of already de-duplicated random media content (mostly web spider-gathered images, videos, and text files–nearly 1.9 million files in total!) and on a Ryzen 5 5600G machine with the -y . option (use a hash database; the period aliases to jdupes _hashdb.txt) the comparisons after the first run finishes in only a few seconds; I tested the same run without the hash database feature and when it was 10% finished and I was tired of waiting, it had spent 71 seconds running unnecessary comparisons. When new files are added it only takes a few seconds to find and delete them.

The major flaw in the hash database feature is in how it handles file and folder paths. If you run jdupes -y . testdir and jdupes -y . ./testdir and jdupes -y . ../currentdir/testdir the database component sees those as completely different paths. This is an obvious flaw, but the workaround is to run jdupes from the same working directory each time and to specify the target path(s) the same way each time. This issue will be fixed in a future release.

Cross-copied hashes: faster scans by not repeating work

Every time jdupes examines a pair of files to see if they could be duplicates, the contents of both files are read and hashes (numbers based on a file’s contents that are used as a “shortcut” to quickly compare that file against other files) of the contents are generated, then they are compared to see if files should be examined even further. An upcoming performance enhancement to jdupes is hash cross-copying. This is where two files being compared during a scan are found to be hard-linked (they look like two separate files to the user but point to the same data and metadata on-disk) and the hashes are copied between the in-memory information about each file. Before this enhancement, two files that are literally the same file would be hashed and compared separately, potentially wasting a lot of time re-reading the same data to calculate the same hashes. While this enhancement doesn’t completely avoid this unnecessary work, it does avoid it after a hard-linked pair of files is detected. This can result in a significant performance boost. It will be included in the next release of jdupes.

Leaving GitHub: forced two-factor authentication is evil

The last news is also the worst. I (Jody Bruchon) intend to leave GitHub entirely by the end of September 2023. The main driver of this is GitHub implementing mandatory two-factor authentication (abbreviated 2FA, typically thought of by most people as “requiring a cell phone to log in”), though there are several other reasons for leaving that have built me up to this point. 2FA is a massive double-edged sword. I have run a PC repair shop for a very long time and I have seen countless customers lose their email and social media accounts forever due to a combination of being railed into 2FA they didn’t want plus losing access to the phone number or the actual phone itself that 2FA was set up to work with.

GitHub two-factor authentication blog post graphic
GitHub’s picture of a phone locking you out of your GitHub account. How appropriate.

The increased security offered by 2FA comes from requiring verification outside of the traditional “give me your password” system of logging in; someone who gets their hands on your password probably won’t also have your unlocked cell phone in their hand when they try to log into your account using that stolen password, and that’s how and why 2FA works in a nutshell. Unfortunately, this also means that your password isn’t enough for YOU to log in, either; if you lose access to your second factor, you lose your account even if you know your password, and that’s how and why 2FA is a bad thing in a nutshell. There are other aspects such as the use of “authenticator apps” on smartphones that pose additional tracking and privacy risks that SMS-code-based 2FA doesn’t, but this isn’t the appropriate place to discuss this topic in such depth.

The general plan is to make this website, jdupes.com, the official website for the jdupes and libjodycode projects, while all my other software projects will be re-homed to the Software page on jodybruchon.com. I haven’t figured out what I’ll do about hosting my Git repositories and an issue tracker yet. Keep checking the jdupes.com front page for updates to see where things land.