This is the first in a series of articles exploring the application of data analytics to eDiscovery. Webster’s Dictionary defines data analytics as “a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making.”  Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in multiple areas of business, science, and social science domains, including eDiscovery.

Given the fact that the volume of Electronically Stored Information (ESI) is doubling every two years[i], data analysis is the primary means by which legal professionals can streamline the discovery process to make it affordable, reasonable and proportional for all parties involved.

Data Analytics for eDiscovery can be divided into three groups;
Structured, Conceptual and Predictive

First, there is Structured Analytics where the data structure is evaluated for better organization and sorting.  Examples of structured analytics for eDiscovery include Email Thread detection, Duplicate and Near Duplicate detection, Language identification and Repeated Content detection. Structure analytics deals with organization, textual similarity and other items beyond the conceptual content of the data itself. Structured analytics is normally based on syntactic approaches that utilize character associations as the foundation for the analysis.

Conceptual Analytics focuses on the conceptual content of the data.  Approaches such as Clustering, Categorization, Conceptual Search, Keyword Expansion, Themes & Ideas, Intelligent Folders, etc. are dependent on technology that builds and then applies a conceptual index of the data for analysis.  Conceptual analytics is based on semantic approaches which explore the meaning of the text contained within the data.

Predictive Analytics (a.k.a. “Predictive Coding” or “Technology Assisted Review”) is a workflow where a human subject matter expert reviews a small subset of documents in order to train the system on what the human is looking for until the system can statistically “predict” how the human would code for the rest of the collection.  Once complete, the legal team can make informed decisions on how best to approach the collection for review and determine the total cost implications.

The focus of this article series will be on the third group of data analysis–Predictive Analytics. Learn more about the two other groups here:

Predictive analytics can significantly lower costs, dramatically reduce review time and substantially increase quality for document review. Predictive Analytics has been proven effective to the point the judiciary is suggesting (or ordering) counsel to consider Predictive Analytics in their eDiscovery protocols.  Furthermore, the Department of Justice (DOJ) antitrust division recently stated they prefer merging parties use predictive analytics for Hart Scot Rodino (HSR) Second Requests.

How We Got Here

Technology has greatly affected the efficiencies of document review over the years. Keyword searching — in which a pre-determined set of words are run against all of the documents, and only those documents that “hit” on one or more key words are reviewed — was a breakthrough technique. And yet, while studies (including the Blair-Maron study) conclude that keyword searching only recalls between 20 and 40 percent of relevant documents, keyword search still remains the most common approach used today for reducing the number of documents to be reviewed.

Keyword search was advanced through methods such as Boolean searches, which allowed the combination of a set of “but not” terms to disambiguate over-inclusive keywords. For example, a search string would state, “Include (keyword1, keyword2, keyword3, etc.), but not (excludeword1, excludeword2, excludeword3, etc.).” These combinations (or “ontologies”) often included proximity limiters (i.e., Abraham w/2 Lincoln).

Studies have shown that ontologies can improve retrieval recall to a range of 65 to 80 percent. However, given the syntax rules that govern ontologies, counsel had to engage highly skilled linguists to develop productive combinations. This was expensive and time consuming. The application of predictive analytics changes all that by automating this labor-intensive approach.

What is Predictive Analytics?

Predictive Analytics encompasses a variety of techniques from the fields of statistics, data mining and game theory that analyze current and historical facts to make predictions about future events. In business, predictive models exploit patterns found in historical and transactional data to identify risks and opportunities. Models capture relationships among and between many factors to allow assessment of risk or potential associated with a particular set of conditions, guiding decision-making for candidate transactions.

Predictive Analytics is used in various fields actuarial science, financial services, insurance, telecommunications, retail, travel, healthcare, pharmaceutical.  One of the most well-known applications is credit scoring. Scoring models process a customer’s credit history, loan application, customer data, etc., in order to rank individuals by their likelihood of making future credit payments on time. Predictive analytics is now being applied the legal discovery process – especially in the area of document review—via Predictive Coding (PC) and Technology Assisted Review (TAR).

How Does Predictive Analytics Work?

Predictive Analytics combines statistics with the efficiencies of a computerized sampling system and a human “expert.” The human interacts with the system by making “yes/no” calls to a question against a series of controlled samples. Questions such as “Is this document responsive?” or “Does it pertain to this specific issue?” or “Is this produced document important for my side of the case?” are asked. The system builds a classification model in the background as it learns from the expert and presents subsequent samples. As it learns, the system ultimately will build the classification model to the point where it can “predict” what the human will choose as “affirmative” in the sample they are reviewing. Once it predicts accurately over a series of consecutive samples, the system is considered “statistically stable” and scores can be applied to the rest of the collection.  Those scores are then used in a variety of subsequent workflows such as Early Case/Data Assessment, Strategic Review Stratification, Accelerated Document Review and Review QC/Verification.

Both Predictive Coding and Technology Assisted Review solutions use a form of predictive analytics in their systems.  There are differences between the two but the ultimate goal is the same – prioritization.  The results of training the system on what you want it to learn will enable workflows to access the most important documents first, as well allow the least important documents to be set aside to be sampled for verification.


Predictive Analytics used in today’s Predictive Coding and Technology Assisted Review solutions goes far beyond keyword searching. It is a powerful tool that uses well established classification algorithms rather than just discrete keyword searches. Unlike keyword searches, Predictive Analytics takes into account all the words in a document, as well as words to exclude, along with the relationship of the words to one another, to determine what is and what is not likely to be relevant. Predictive Analytics incorporates human intelligence to leverage the results of review across large document populations. It can be used in a variety of workflows, in several places along the EDRM lifecycle including Early Case/Data Assessment, Strategic Review Stratification, Accelerated Document Review and Review QC/Verification.

In future articles, we will take a deeper look at the differences between Predictive Coding and Technology Assisted Review.

Related Content:

[i] International Data Corporation, “The Digital Universe in 2020: Big Data, Bigger Data Shadows, and Biggest Growth in the Far East,” December 2012. Online at,