Amazon’s Junk Ebook Problem and a Computer Forensics Approach to Solve It

Recently Reuters ran a story regarding (i) Amazon’s ebooks business that is part of its highly popular Kindle readers and (ii) reader applications for phone and table devices.   One consequence of these offerings is they provide a way for writers to self-publish their own work for free.

Unfortunately whenever the words Internet and free are combined, the result is typically a mountain of worthless content in search of easy fame or a quick buck.  Reuters referred to the flood of bogus books as spam, but that term more appropriately refers to unwelcome advertising pushed at the user.  This is more of a matter of selling an article of dubious value, or scam.

Some of the scam ebooks are Private Label Rights content, which is the written equivalent of stock photos.  Legally, there is no reason why someone cannot turn them into an e-book, or turn them into dozens of ebooks each with a slightly different title and slightly different price and slightly different author.  Others are simply theft, taking one person’s inexpensive or free e-book and repackaging it under a different name with the hope of stealing a few sales before being caught.  Either way it is hardly in Amazon’s best interest to have its catalog be flooded with drivel, either from the standpoint of the customer or of the legitimate publisher.

Reading of this problem reminded me of a tool we occasionally use in computer forensics – the fuzzy hash.

If the question was simply to remove exact duplicates, the technique to do so is  to calculate and keep in the catalog an SHA or MD5 hash value.  Chances are they were calculating a hash value anyhow for the purposes of validating the accuracy of multiple copies of the file on various servers.  What is great about SHA or MD5 hashes is that, if even a single byte is changed, the entire hash value is different.   Unfortunately Amazon’s problem isn’t so much identical files as nearly identical files.  The creators of the junk e-book will at least change the title, change the author, and possibly add or remove a paragraph, or some other search and replace.  That is more than enough to create a hash value that would be just as different as if it was an entirely different book.

Instead, fuzzy hash is needed.  This is a hash calculation where similar files result in similar, or even identical, hashes.  Fuzzy hashes are frequently used in indexing biometric data.   For example, even though our fingerprints never change, a particular sample of our fingerprint might differ enough from the stored version to not match to computer precision.  Thus, there needs to be a little ‘wiggle’ allowed in the matching software.  In general, fuzzy hashes are specific to individual data types.  For example, to see an excellent example of fuzzy hash indexing of images, visit

Amazon’s ebooks are text files of a specific format.   The most common method of fuzzy hashes for text documents is called Context Triggered Piecewise Hashing, which was developed by Andrew Tridgell when researching ways to detect spam email.  Trigdell called his technique spamsum.  The technique was further implemented as a program including edit distance matching called ssdeep by Jesse Kornbloom.

Essentially the program works as follows.  There is a particular string of characters that forms the ‘trigger’.  The program reads from the beginning of the file to the first occurrence of the ‘trigger’.  It then runs an ordinary hash on that piece of data and takes only the very last byte.  It then reads the file to the next trigger, hashes that stretch of data, and then again only takes the very last byte to add to the output.  This continues until the end of the file is reached, at which point the remainder is hashed to form the last character of the result.  Thus a spamsum hash will not be of a fixed length like conventional hashes, but will be one byte longer than the number of occurrences of the trigger in the file.

So the spamsum calculation provides similar results for similar documents.  But how does the computer then evaluate what is ‘similar’?  The answer in ssdeep is a familiar computer algorithm called the edit distance.  The edit distance can be thought of as the number of keystrokes, either in inserting a character, deleting a character, or overwriting one character with another, that it would take to change the one string to another.  So for example:

To get from string to sting would be an edit distance of 1 (removing the r).  To get from string to strong would be an edit distance of 1 (replacing the i with o).  To get from string to strange would be an edit distance of 2 (replacing the i with an a and adding the e at the end).  Performing an edit distance calculation on a text files the size of a book would be an impractical task.  Performing it on their fuzzy hash values, though accomplishes the same thing.

The user of ssdeep specifies a matching threshold.  For example, a matching threshold of 95 it means that the edit distance is less than 5 percent of the length of the file.  This would be a very strong correlation with a very high chance that the two files are nearly identical.  Amazon could calculate and save spamsum hashes of all their ebooks.  Then, when a new e-book is submitted, the prospective publisher compares its spamsum calculation to those of the rest of the library.  A near-match would trigger a hold on the new submission.

The ssdeep program is not used often in computer forensics, but it can be highly valuable in the right situation.  For example, there was a case where a sales manager went from one firm to a competitor.  It was known that the person inappropriately took with him a copy of thousands of the computer files he had access to on a USB hard disk.  We were asked to investigate the extent to which those files were utilized at their new employer for the purpose of determining trade secret  damages.  A file that was similar to one of the taken documents (such as changing the letterhead, dates, product names and addresses, but otherwise the same) was far more relevant than the exact copies that an MD5 analysis would have given.  However there were such a large variety of files that searching for key phrases would have been impractical.  In this case the ssdeep program proved efficient and yielded relatively few false positives.

Permanent link to this article:


1 ping

Skip to comment form

    • Larry on June 29, 2011 at 4:00 PM
    • Reply

    This idea sounds brilliant. I was wondering if you submitted this idea to Amazon, of if you happen to know if that company would even have experts who could implements such an idea. The reason I ask is I’m an author and had been thinking about placing my ebook with Amazon until hearing about the ebook plagarism problem.


    1. If you were offering a very inexpensive e-book in the 99 cents range it might be worth occasionally checking for plagiarism, but in general there is enough free or low cost content that the makers of the junk books are unlikely to want to put money up front.

    • Yheyen on August 25, 2012 at 8:16 AM
    • Reply

    , I would like to make some comments about some of your stttmeenas to spark the conversation.“As a cellphone examiner you often have to use multiple tools during an exam. If you are not, then how are you conducting any validation?” Here in my department’s lab we generally validate every phone exam by reviewing the phone’s contents manually and comparing what we see to what is on the report. We have processed may phone’s with two or more ‘tools’ only to find that both tools reported inaccurate and / or questionable results. Just because two different tools report the same findings does not make them accurate nor does it ‘validate’ the tool. We have had several occasions where we processed the same make / model phone with the same tool and, during our review found that one phones report was correct while the other phones report was wrong. I have come to believe (at least for now) that no tool should be relied upon completely although I know of other individuals (not in my lab) who are simply running phones through what ever tool seems to work best and then burning the results to a CD without ever reviewing the report. As for your process of; Dump the file system / obtain hash values – Process the phone – dump the file / obtain hash values and compare the results.Are you serious? On every phone we process? I have never heard of this before and I question this process for a couple of reasons;1.To set this up as a policy seems to be inviting problems in court. While I know that there is currently no way to dump the file systems on some phones, it seems that a good defense attorney could throw a lot of ‘mud’ if this was your policy and you processed a phone where the file system could not be dumped. I know it should be as easy as just saying that ‘it is not currently possible to dump a particular phones file system’ but it never is.2.We frequently process phones where our detectives only want us to obtain specific items from the phone such as only text messages or only images. While the reasons for this vary they include; search warrant limitations and consent limitations. Often our detectives will only want us to document specific text messages that pertain to their case and nothing more. Why dump the file systems on these phones?3.After reading your process, I thought I would try it on a phone that I was currently processing. The phone happened to be a Samsung SCH-R350 and Cellebrite supported a file system dump of this model. I connected the phone and everything went perfectly as the file system dump started. I then waited. I went to lunch, returned and waited some more. After waiting for more then 5 hours, I finally canceled the process. I did some quick math (not my strongest subject) and determined that processing this phone using your method would have taken me more then at least 10 and more likely 15 – 20 hours. While I realize that not every phone will take this long to process, unless we are trying to obtain deleted data, why? While I realize that some smaller departments may only process a few phones each month, here in our lab we will process about 600 phones this year and that number is constantly growing. That is in addition to the more then 200 computers we process. I have a difficult time justifying that type of time expenditure. We here in our lab estimate about 4 hours per phone examination, some phones take less time while others take longer. Doing a little more math for your method, assuming about 10 hours per phone (total processing time including two file systems dumps, processing the phone, documentation and report writing, comparing the results to the phone’s display, hashing and reviewing the file systems dump results…) equates to 6000 hours per year. Figuring that (without vacations, sick time, holidays, etc.) the average full time worker works 2080 hours per year, I would need more then 3 full time examiners just to process our cell phones.Like I stated in the beginning, I greatly appreciate your work in the cellular phone forensics arena and I am constantly learning. Keep up the good work,Ritch

    • Niklas on November 20, 2013 at 8:55 PM
    • Reply

    You have impressive information on this site.

  1. […] Amazon's Junk Ebook Problem and a Computer Forensics Approach to … […]

Leave a Reply to Niklas Cancel reply

Your email address will not be published.