Recently Reuters ran a story regarding (i) Amazon’s ebooks business that is part of its highly popular Kindle readers and (ii) reader applications for phone and table devices. One consequence of these offerings is they provide a way for writers to self-publish their own work for free.
Unfortunately whenever the words Internet and free are combined, the result is typically a mountain of worthless content in search of easy fame or a quick buck. Reuters referred to the flood of bogus books as spam, but that term more appropriately refers to unwelcome advertising pushed at the user. This is more of a matter of selling an article of dubious value, or scam.
Some of the scam ebooks are Private Label Rights content, which is the written equivalent of stock photos. Legally, there is no reason why someone cannot turn them into an e-book, or turn them into dozens of ebooks each with a slightly different title and slightly different price and slightly different author. Others are simply theft, taking one person’s inexpensive or free e-book and repackaging it under a different name with the hope of stealing a few sales before being caught. Either way it is hardly in Amazon’s best interest to have its catalog be flooded with drivel, either from the standpoint of the customer or of the legitimate publisher.
Reading of this problem reminded me of a tool we occasionally use in computer forensics – the fuzzy hash.
If the question was simply to remove exact duplicates, the technique to do so is to calculate and keep in the catalog an SHA or MD5 hash value. Chances are they were calculating a hash value anyhow for the purposes of validating the accuracy of multiple copies of the file on various servers. What is great about SHA or MD5 hashes is that, if even a single byte is changed, the entire hash value is different. Unfortunately Amazon’s problem isn’t so much identical files as nearly identical files. The creators of the junk e-book will at least change the title, change the author, and possibly add or remove a paragraph, or some other search and replace. That is more than enough to create a hash value that would be just as different as if it was an entirely different book.
Instead, fuzzy hash is needed. This is a hash calculation where similar files result in similar, or even identical, hashes. Fuzzy hashes are frequently used in indexing biometric data. For example, even though our fingerprints never change, a particular sample of our fingerprint might differ enough from the stored version to not match to computer precision. Thus, there needs to be a little ‘wiggle’ allowed in the matching software. In general, fuzzy hashes are specific to individual data types. For example, to see an excellent example of fuzzy hash indexing of images, visit tineye.com.
Amazon’s ebooks are text files of a specific format. The most common method of fuzzy hashes for text documents is called Context Triggered Piecewise Hashing, which was developed by Andrew Tridgell when researching ways to detect spam email. Trigdell called his technique spamsum. The technique was further implemented as a program including edit distance matching called ssdeep by Jesse Kornbloom.
Essentially the program works as follows. There is a particular string of characters that forms the ‘trigger’. The program reads from the beginning of the file to the first occurrence of the ‘trigger’. It then runs an ordinary hash on that piece of data and takes only the very last byte. It then reads the file to the next trigger, hashes that stretch of data, and then again only takes the very last byte to add to the output. This continues until the end of the file is reached, at which point the remainder is hashed to form the last character of the result. Thus a spamsum hash will not be of a fixed length like conventional hashes, but will be one byte longer than the number of occurrences of the trigger in the file.
So the spamsum calculation provides similar results for similar documents. But how does the computer then evaluate what is ‘similar’? The answer in ssdeep is a familiar computer algorithm called the edit distance. The edit distance can be thought of as the number of keystrokes, either in inserting a character, deleting a character, or overwriting one character with another, that it would take to change the one string to another. So for example:
To get from string to sting would be an edit distance of 1 (removing the r). To get from string to strong would be an edit distance of 1 (replacing the i with o). To get from string to strange would be an edit distance of 2 (replacing the i with an a and adding the e at the end). Performing an edit distance calculation on a text files the size of a book would be an impractical task. Performing it on their fuzzy hash values, though accomplishes the same thing.
The user of ssdeep specifies a matching threshold. For example, a matching threshold of 95 it means that the edit distance is less than 5 percent of the length of the file. This would be a very strong correlation with a very high chance that the two files are nearly identical. Amazon could calculate and save spamsum hashes of all their ebooks. Then, when a new e-book is submitted, the prospective publisher compares its spamsum calculation to those of the rest of the library. A near-match would trigger a hold on the new submission.
The ssdeep program is not used often in computer forensics, but it can be highly valuable in the right situation. For example, there was a case where a sales manager went from one firm to a competitor. It was known that the person inappropriately took with him a copy of thousands of the computer files he had access to on a USB hard disk. We were asked to investigate the extent to which those files were utilized at their new employer for the purpose of determining trade secret damages. A file that was similar to one of the taken documents (such as changing the letterhead, dates, product names and addresses, but otherwise the same) was far more relevant than the exact copies that an MD5 analysis would have given. However there were such a large variety of files that searching for key phrases would have been impractical. In this case the ssdeep program proved efficient and yielded relatively few false positives.