Thank you for sharing!

Your article was successfully shared with the contacts you provided.
At some point in many cases, lawyers go from not knowing enough about their cases to being presented with too much information. The cause of this overload is usually the electronic dissemination and replication of information. For example, a worker may send an e-mail to six co-workers, all of whom later have to produce their e-mails in response to discovery requests. Exact copies or duplicates — dupes — of that one e-mail may appear in the seven different e-mail boxes and numerous backup systems. Further, those copies may be forwarded, replied to, or copied and pasted into other e-mails. Those e-mails are related but not identical to the original and are sometimes called near dupes. The problem is compounded when the e-mails are printed out. Without a method to remove true duplicates and group near dupes or threads of an e-mail conversation for analysis, lawyers and paralegals working on a project are left with an overwhelming feeling that they’ve looked at the same document or e-mail already — and they probably have! Not having a system for dealing with dupes or near dupes results in tremendous amounts of wasted effort from repeatedly making the same decisions for different versions of the same document. This also leads to inconsistent decisions. Further, decisions are made in the face of incomplete facts. For instance, privilege claims are made without knowing all of the people who received copies of a document. The emergence of systems that deal almost exclusively with electronic discovery can exacerbate the problem by leaving parties without an integrated method of dealing with their paper discovery. While electronic discovery is de-duped and decisions are made for electronic documents, decisions regarding scanned paper documents are made in separate systems without ready reference to the decisions pertaining to the corresponding electronic versions. A true duplicate of an electronic file is an exact copy of another file. It has exactly the same data. The only thing differentiating the two duplicate files is that they are found in two different locations. Rather than perform a bit-by-bit comparison of every electronic file against all other files, electronic discovery systems calculate hash values, a form of electronic fingerprint, to pinpoint duplicates since any change, however slight, found in two files will result in different values being calculated. The most common form of de-duping is vertical and involves removing duplicates from within a single custodian’s records. A custodian is usually one person, not an organization. For example, if Ted sends Alice and Bob an e-mail and all three regularly back up their systems in different frequencies, the e-mail may exist in several locations. Vertical de-duping would mean that only one copy would be produced for each of the three custodians, regardless of the number of backups. With horizontal de-duping, the files in the entire production are evaluated using their hash values and possibly other criteria such as file names and dates. In the above example, instead of producing three copies (one from each custodian), only one would be produced — a considerable savings in both printing and attorney review time. Obviously, with this approach, there needs to be a procedure so that an indication of the source for each copy of the de-duped record is carried into the final database. Most electronic evidence vendors can perform horizontal de-duping and provide the firm with a file noting the locations of the duplicate files. Then when the de-duped e-mails or documents are represented in the database, the people reviewing the documents can determine where they were located in the initial production. Otherwise there will be situations where, for instance, one of the intended recipients of an e-mail did not actually receive it, and without the information on sources this would not be apparent in the database. An e-mail thread is the collection of replies, forwards, and blind carbon copies (BCCs) that results from a single initiating e-mail. The individual e-mails in a thread are sometimes called near-dupes because while they were all triggered by and contain the text of the same originating e-mail, the content added in the course of replying or forwarding causes them to be somewhat different. Because the content of the various replies, forwards, and BCCs are different, hash values cannot be used also to identify e-mails that are near-dupes of an originating e-mail. However, it is useful to be able to quickly pull all the related threads together, in both paper and electronic copies. Linguistic pattern matching can be a tremendously useful tool for identifying near dupes and e-mail threads. Without having to formulate complex searches, users can quickly find records that are linguistically like the e-mail being reviewed and rapidly locate other messages containing the same text. In fact, linguistic pattern matching can even identify text copied from one e-mail and pasted into another e-mail or document. Of course, with all the emphasis being placed on e-discovery, it can be easy to lose sight of the fact that paper-based records still form an integral part of the discovery process. Absent agreement of the parties or a special pretrial order, parties are obligated to produce both paper and electronic copies of responsive records. Obviously a party producing or receiving mixed productions of paper and electronic records will want to be able to quickly locate paper and electronic versions of the same records. Because the text associated with scanned paper records is produced by optical character recognition (OCR), which contains intermittent errors, hash values cannot be used to identify paper duplicates of electronic records. Traditional search methods such as full-text searching are also apt to be an unwieldy solution for the identification of near-duplicates and paper duplicates. Intermittent OCR errors in the paper documents typically require users to formulate searches using the expansive or connector instead of the more restrictive and, near, or adj connectors. Even with no OCR errors, full-text searching typically returns many more documents than the user can realistically examine. Also, the results are not sorted according to how closely they match the desired text. By contrast, linguistic pattern matching offers the advantages of having searches formulated by just specifying the comparison text, searches not affected by intermittent OCR errors, and search results ranked according to how closely the results match the comparison text, letting users focus on the most relevant results. Properly handling electronic and paper duplicates and near-duplicates presents a challenge to parties engaged in large discovery. However, effective use of de-duping and linguistic pattern technology can greatly reduce the costs and burdens associated with such discovery. Joseph Howie is director of client services for Syngence, LLC, provider of electronic discovery services and linguistic pattern searching technology.

This content has been archived. It is available through our partners, LexisNexis® and Bloomberg Law.

To view this content, please continue to their sites.

Not a Lexis Advance® Subscriber?
Subscribe Now

Not a Bloomberg Law Subscriber?
Subscribe Now

Why am I seeing this?

LexisNexis® and Bloomberg Law are third party online distributors of the broad collection of current and archived versions of ALM's legal news publications. LexisNexis® and Bloomberg Law customers are able to access and use ALM's content, including content from the National Law Journal, The American Lawyer, Legaltech News, The New York Law Journal, and Corporate Counsel, as well as other sources of legal information.

For questions call 1-877-256-2472 or contact us at [email protected]

Reprints & Licensing
Mentioned in a Law.com story?

License our industry-leading legal content to extend your thought leadership and build your brand.


ALM Legal Publication Newsletters

Sign Up Today and Never Miss Another Story.

As part of your digital membership, you can sign up for an unlimited number of a wide range of complimentary newsletters. Visit your My Account page to make your selections. Get the timely legal news and critical analysis you cannot afford to miss. Tailored just for you. In your inbox. Every day.

Copyright © 2021 ALM Media Properties, LLC. All Rights Reserved.