One day last week, in three meetings with three different clients, I heard the same questions raised about the problem of duplicates in document collections. Ironically, the problem is greater in document collections that are paper-source or mixed paper and electronic source.

Purely electronic source document collections are, for all their other problems, easily de-duped, and the trend these days is not just to de-dupe within custodians, but preferably to de-dupe across the entire database. And the really good news is that full-scale de-duping can get rid of a lot more than you might have guessed.

About a year ago, in the August 2009 issue of Law Technology News, Anne Kershaw and Joe Howie (another InsideTech columnist) reported on a study they conducted by surveying 18 e-discovery vendors. Confining the scope strictly to pure de-duping (as opposed to near-duplicate detection, e-mail threading, etc.), they found that de-duping within a single custodian reduced the number of documents by an average of 21.4 percent; if performed across multiple custodians, the average reduction nearly doubled to 38.1 percent.

Yet the vendors indicated that, while they all offered cross-custodian de-duping, only 52 percent of the projects got it; in the remainder, their clients opted for either single-custodian de-duping (41 percent) or none at all (seven percent).

Until a few years ago, for many e-discovery vendors, the machine burden of de-duping across custodians was much greater than doing so within one custodian’s collection. Some vendors charged nothing for de-duping within custodian, but charged extra if done across custodians, to compensate for the extra machine time and effort.

Also, in the then-common linear review paradigm (each custodian’s data kept together and reviewed as a unit) de-duping within custodian only was supported by the prima facie plausible argument that “it’s a more accurate picture” of the data to know who had what, even if it did mean that the same document was going to show up multiple times in different custodians’ collections. The mere fact of it being in Al’s collection as well as Barbara’s and Charlie’s was somehow considered sufficient differentiation to justify keeping all three.

De-duping technology is now much better, so cross-custodian de-duping no longer grinds the system to a near halt. On top of which, as this article points out, if you need a report as to which other custodians also had a particular document, just about any vendor or hosting platform can generate one.

Articles such as the one by Anne and Joe, and other consultants, should reassure lawyers that de-duping across the entire database is not just alright, it’s practically incumbent upon them. As these authors state, with the concurrence of several judges they consulted: “Lawyers who fail to check for duplicates across multiple custodians, instead removing only duplicates from within the records of individual custodians, end up reviewing at least 20% more records on average.Whether or not their document review bills are ever audited, these lawyers are not meeting their ethical obligations to both clients and the justice system.”

But what about the problem of duplicates within paper document collections (yes, these still exist) and mixed paper and electronic collections? Here we face a problem we’ve had since the beginning of litigation support, though we now have some tools to address it that we didn’t have back in the 1990s.

In a paper collection, it’s possible that the same document occurs in box 3, box 14, box 19 and box 24, and the original electronic source file it was printed from may exist in the electronic data collection. What are the means by which these paper duplicates can be identified, thereby eliminating the wasted time reviewing the same document and the risk of different reviewers making different decisions, for example one saying “relevant” and another saying “not relevant” or one saying “privileged” and another saying “not privileged?”

If the paper collection has been bibliographically coded first, this means that the effort of bibliographically coding multiple versions of the same document has already been done, but at least subjective review time has not yet been wasted on them. At this point, selective sorting on key fields may group together documents that are in fact the same, and a senior reviewer can tag and perhaps move aside those duplicates that should not receive further attention.

Another approach might be implemented even prior to bibliographically coding. If the paper documents have been OCR’d after scanning, then by applying near-duplicate detection technology it may be possible to group together those documents that have such a high percentage of similarity that they are likely duplicates, and again a senior reviewer can tag those that should not receive further attention either in the way of bibliographic coding or subjective review. This may either save a great deal or very little depending on the quantity of duplicates in the collection. As near-duplicate detection software is usually charged on the full volume of documents it is required to “look at,” if it doesn’t come up with too many hits of near-duplicates, then the saving it has created may not be worth its cost. It would help to know in advance from general familiarity with the paper collection whether there are likely a large number of duplicate documents.

Active in litigation support and e-discovery since the late 1980s, Cliff Shnier is an attorney and electronic discovery consultant who divides his time between his base in Scottsdale, Arizona and Toronto, Ontario. E-mail him at