“I thought I just reviewed that document!” is something every document reviewer since the dawn of time has thought. Duplicate documents, and documents that seem like duplicates but aren’t, can slow down document review, increase costs and lead to inconsistent document coding—which can be extremely problematic when occurring during a privilege review. This article explores why document reviewers who claim to see duplicates are right and wrong at the same time by explaining how deduplication works and the inherent shortcomings with that process. This article also offers a solution: two technologies—near-dupe detection and email threading—which can greatly reduce the number of “duplicate” documents that must be reviewed.

Deduplication sounds simple in theory: Remove all of the duplicate documents when you load documents into your document review platform. Deduplication isn’t simple in practice: Is a PDF and Word document that have the exact same text a duplicate ? (No) What about a document that is attached to an email and saved on someone’s hard drive? (Yes) The slightest of differences in documents, which might not be perceptible to the document reviewer, or documents saved in different formats, can mean the documents are not exact duplicates. If they aren’t exact duplicates, you’ll be stuck looking at what is practically the same document multiple times, because the documents won’t generate the same hash value (what was used to answer the questions posed above). Hash values are the standard method document review platforms use to deduplicate documents. When the documents are processed into the platform, a hash value for each document will be generated and compared with all of the documents already processed, and any documents with matching hash values will not be made available for review. (How hash values are generated is beyond the scope of this article.)