The task of collecting, identifying and properly using key documents is the foundation of a successful internal investigation. Yet, this task continuously grows more complicated and costly given the swelling volume of data generated in the ordinary course of business. The pressure on counsel to reduce costs is ever-present, but may be particularly acute in the context of internal investigations, where there is likely no financial upside at an investigation’s conclusion. Fortunately, emerging technologies in the field of electronic discovery now enable lawyers to more quickly and accurately identify the most important documents using a comprehensive and defensible process that is substantially less expensive and potentially more effective than commonly used alternatives.

One technology that is growing in acceptance—and may be particularly well-suited to internal investigations—is predictive coding. Predictive coding is one tool that falls under the umbrella of “technology-assisted review.” Predictive coding combines human guidance with computer-piloted concept searching in order to “train” document review software to recognize relevant documents within a document universe.

Concept searching is fundamentally different from traditional keyword searching. Keyword searching involves associating documents based on specified words or phrases. When performing a traditional keyword search, software receives criteria from a user and returns all documents satisfying those exact criteria. Its ability to pinpoint conceptually similar documents is strictly limited by whether the specified keywords are present in those documents. Conceptually similar documents that do not include the exact keywords will not be detected.

A keyword search does not distinguish between homonyms, nor does it accord greater weight to conceptually significant vocabulary than to insignificant, irrelevant words. Therefore, relying solely on keyword searches to cull a document universe almost always yields many non-responsive documents and often misses highly relevant documents because the search criteria did not include the exact phraseology contained in the documents.

In contrast, a concept search is a more sophisticated method of searching that does not require the user to identify all potential relevant keywords from the outset. It is also simpler to execute from the user’s standpoint. Concept searching is designed to find and categorize documents based on concept similarity, not merely on the words they contain. Concept-based searches can identify relevant documents based on secretive code words that individuals sometimes use when trying to disguise malfeasance that may be unknown at the inception of the investigation. Concept searches can even identify language features associated with certain emotions, allowing the review team to direct its attention to documents more likely to contain the kind of inflamed rhetoric that is often associated with “hot” documents.  Several studies have shown that predictive coding is more accurate than traditional linear review or use of search terms.

Executing an internal investigation using predictive coding begins with generation of a randomly selected, statistically significant seed set of documents. A small number of attorneys, who must be both knowledgeable about the issues in the investigation and experts in using the review software, review and code the seed set of documents, classifying each document as either relevant or irrelevant to the investigation.

The search software then analyzes the documents determined to be relevant and searches for similar documents throughout the document universe. To do so, the software looks at broad patterns of language to determine what the relevant documents have in common conceptually. The software then creates conceptual profiles of both relevant and irrelevant documents, applies these profiles to the rest of the documents in the universe and designates the remaining documents as either presumptively relevant or irrelevant. 

Once this process is complete, the results are validated. If necessary, the team can “retrain” the software for more precise results and then re-run it on the same universe of documents. The rate at which the human reviewer overturns’ the software’s decisions, known as the “overturn rate,” indicates how well-trained the software was by the first manually reviewed seed set.

Predictive coding is gaining increased acceptance in the litigation context as an aid to review, although litigants and courts have yet to fully embrace it. Litigants and judges have in some cases expressed doubts about the defensibility of relying on software to judge the significance of documents. Also, while parties may agree in concept to use this technology, they may have difficulty agreeing how to apply it in a particular case. Parties may not always trust the process implemented by their opponents, and therefore may be reluctant to agree to its use. Moreover, many attorneys remain uncomfortable with the idea of producing documents without having first reviewed them for relevance, privilege and confidentiality. Thus, in a litigation context, predictive coding is—at present—often used as an aid to, rather than a substitute for, traditional linear review.

In contrast, the use of predictive coding in the context of internal investigations often does not raise the same concerns that arise in the context of litigation. In many internal investigations, there is no need to negotiate with or satisfy outside parties regarding the technology’s reliability. Additionally, as internal investigations typically do not require document production, there are no concerns about producing privileged material.

Despite the fact that predictive coding has been shown to be more accurate and efficient than a traditional linear or keyword search review, concerns regarding its use persist. People are often hesitant to trust the technology to identify the documents they seek, fearing that it will miss important documents that a more traditional process would have captured. They are also hesitant to add up-front technology costs to a project when they are uncertain of the results and concerned that they will still need to review a large number of documents manually. 

However, one significant benefit of predictive coding is that it can significantly reduce overall cost, because it reduces the number of documents that need to be individually reviewed in order to achieve results comparable to those achieved with traditional techniques. While there is typically some additional up-front cost associated with using the technology, considerable savings result from the decreased need for reviewer time. Furthermore, the mistrust of concept searching and predictive coding can be alleviated by performing quality control checks of statistically significant samples to ensure that the software is yielding the anticipated results. 

Although lawyers may initially be reluctant to embrace this new technology, real-world experience has shown that proper use of predictive coding significantly lowers overall costs and increases accuracy. Moreover, it can be a particularly powerful tool in the context of internal investigations, where a high value is placed on swiftly and efficiently identifying only the most significant documents.