News: hash database feature, cross-copying hard link hashes, jdupes leaving GitHub

Hash database: millions of files scanned in seconds

jdupes version 1.27.3 has been released with a long-awaited and heavily requested feature: a file hash database. While the feature is not as “smart” I’d like it to be, I’ve decided to release it as soon as possible because it really is that big of a deal. It’s probably not going to help you if you’re scanning a folder tree with a hundred files, but the hash database feature makes a massive difference for repeated jdupes scans of large data sets that don’t change much. I’ve personally run tests on a colossal pile of already de-duplicated random media content (mostly web spider-gathered images, videos, and text files–nearly 1.9 million files in total!) and on a Ryzen 5 5600G machine with the -y . option (use a hash database; the period aliases to jdupes _hashdb.txt) the comparisons after the first run finishes in only a few seconds; I tested the same run without the hash database feature and when it was 10% finished and I was tired of waiting, it had spent 71 seconds running unnecessary comparisons. When new files are added it only takes a few seconds to find and delete them.

The major flaw in the hash database feature is in how it handles file and folder paths. If you run jdupes -y . testdir and jdupes -y . ./testdir and jdupes -y . ../currentdir/testdir the database component sees those as completely different paths. This is an obvious flaw, but the workaround is to run jdupes from the same working directory each time and to specify the target path(s) the same way each time. This issue will be fixed in a future release.

Cross-copied hashes: faster scans by not repeating work

Every time jdupes examines a pair of files to see if they could be duplicates, the contents of both files are read and hashes (numbers based on a file’s contents that are used as a “shortcut” to quickly compare that file against other files) of the contents are generated, then they are compared to see if files should be examined even further. An upcoming performance enhancement to jdupes is hash cross-copying. This is where two files being compared during a scan are found to be hard-linked (they look like two separate files to the user but point to the same data and metadata on-disk) and the hashes are copied between the in-memory information about each file. Before this enhancement, two files that are literally the same file would be hashed and compared separately, potentially wasting a lot of time re-reading the same data to calculate the same hashes. While this enhancement doesn’t completely avoid this unnecessary work, it does avoid it after a hard-linked pair of files is detected. This can result in a significant performance boost. It will be included in the next release of jdupes.

Leaving GitHub: forced two-factor authentication is evil

The last news is also the worst. I (Jody Bruchon) intend to leave GitHub entirely by the end of September 2023. The main driver of this is GitHub implementing mandatory two-factor authentication (abbreviated 2FA, typically thought of by most people as “requiring a cell phone to log in”), though there are several other reasons for leaving that have built me up to this point. 2FA is a massive double-edged sword. I have run a PC repair shop for a very long time and I have seen countless customers lose their email and social media accounts forever due to a combination of being railed into 2FA they didn’t want plus losing access to the phone number or the actual phone itself that 2FA was set up to work with.

GitHub two-factor authentication blog post graphic
GitHub’s picture of a phone locking you out of your GitHub account. How appropriate.

The increased security offered by 2FA comes from requiring verification outside of the traditional “give me your password” system of logging in; someone who gets their hands on your password probably won’t also have your unlocked cell phone in their hand when they try to log into your account using that stolen password, and that’s how and why 2FA works in a nutshell. Unfortunately, this also means that your password isn’t enough for YOU to log in, either; if you lose access to your second factor, you lose your account even if you know your password, and that’s how and why 2FA is a bad thing in a nutshell. There are other aspects such as the use of “authenticator apps” on smartphones that pose additional tracking and privacy risks that SMS-code-based 2FA doesn’t, but this isn’t the appropriate place to discuss this topic in such depth.

The general plan is to make this website, jdupes.com, the official website for the jdupes and libjodycode projects, while all my other software projects will be re-homed to the Software page on jodybruchon.com. I haven’t figured out what I’ll do about hosting my Git repositories and an issue tracker yet. Keep checking the jdupes.com front page for updates to see where things land.

Leave a Reply

Your email address will not be published. Required fields are marked *