E-discovery review: The same old way is no longer good enough

The review of potentially relevant content during discovery is by far the most expensive part of the discovery process, consuming on the average 71 cents of every dollar spent. This large percentage of the discovery cost is mainly due to the fact that every potentially relevant document has to be manually reviewed by human beings so that a decision on each document can be made as to whether it was responsive to the case and needs to be turned over or is privileged. This traditional review process is also known as linear review. The high cost comes from the fact that the average discovery can include three gigabytes of files per custodian and the average gigabytes can contain anywhere between 7,000 and 100,000 pages of documents. To complicate matters more, human reviewers can usually read and make a decision on between 45 and 100 pages per hour so a pre-culled set of documents totaling 10 GB could take a single human reviewer (at the most optimistic) 700 hours or approximately 87.5 days.

In the last several years a new, more accurate and much less costly e-discovery review process has been used successfully in many court cases including several high profile cases. Predictive coding, a next generation document review and analysis technology, includes disruptive technology that has the ability to dramatically reduce document review time, measurably lower document review costs and raise accuracy well beyond traditional linear review techniques.

Predictive coding provides a computer-generated judgment, after several iterations of human managed machine learning, with an explicit confidence score about the relevance of each document. This capability allows counsel to dramatically expedite the actual document review process by automatically determining the relevance of documents, prioritizing and tagging each as responsive, non-responsive, or privileged. The end result is human reviewers actually review a much smaller percentage of the corpus.

To calculate predictive coding cost savings and ROI you first need to understand how traditional legal review is currently conducted and its cost.

Comparing traditional linear review to predictive coding 

Traditional linear review relies on creating a “reviewable results set” of potentially responsive documents by running keyword searches across the enterprise data repositories to compile massive sets of documents that must then be read by legally trained professionals to determine relevancy and privilege. As you can imagine, this can be extremely slow and costly and susceptible to risk. For example, consider an average e-discovery review situation, where 679,349 documents were returned from a keyword search originating from a larger data set of 2,072,282 documents — a 33 percent cull-down rate.

Calculating predictive coding cost savings

In a traditional linear review, all 679,349 post-culled documents would have to be read by humans to determine responsiveness. Figuring an attorney can review and make a decision on 45 to 100 documents per hour, the number of attorney hours needed to review all 679,349 documents would be approximately 15,096 hours. Further calculating the cost per hour for the attorney(s) to review the documents is $50/hour, the total cost of this review would be $754,800.

Using a predictive coding solution for the same example produces an immediate reduction in documents to be reviewed from 679,349 down to 175,018. Predictive coding further reduces the number of documents to be actually read to a small percentage of those 175,018 documents thereby further reducing the total cost of review to $194,464. This reduction in documents to be reviewed produces a savings of $560,336 ($754,800 – $194, 464) or 75 percent.


Another benefit of using predictive coding for document review is the amount of actual review time you can save. The figure below represents three different potential case sizes; small, medium and large. Looking at the relatively small “IP Case”, we see a cost savings of $100,500 and a time savings of 2,010 hours or 251 days based on a single reviewer. The medium sized “Second Request Case” produced a cost savings of $432,300 and 4.25 “single reviewer” years. Lastly, the “Large Tort Case” returned a cost savings of almost $2 million and a time savings of 39,592 hours or 19.63 single reviewer years.


(Calculations include a 55 documents-per-hour review rate, an hourly reviewer rate of $50 per hour and a predictive coding results set review reduction of 70 percent)

Calculating return on investment (ROI) for predictive coding


ROIs can be calculated by individual case, annual discovery cost or over longer periods of time. In this example we will calculate the predictive coding ROI over a 1 year period:


To collect the data points mentioned above, we need to determine the average number of e-discovery requests the organization acts upon per year, the average number of custodians per case, the total data collected in GB, the estimated cull rate as a percentage, the average number of documents per GB, the estimated number of documents a legal reviewer can review per hour, and finally the average hourly rate legal reviewers will charge. With these basic data points we can begin to estimate the cost of a traditional linear review.


Using the total documents after culling number of 1,600,000 (400 GB X 40% = 160 GB X 10,000 documents per GB = 1,600,000) divided by the number of documents per hour a legal reviewer can work (50) produces 32,000 hours to review 1,600,000 documents. To calculate the cost per discovery event, we simply multiply 32,000 hours by the rate of $55/hour to get a cost of $1,760,000. Finally, to calculate the cost of discovery annually, we multiply the number of discoveries per year by the average discovery cost per event (3 X $1,760,000) to arrive at $5,280,000 annually.

The next step in calculating predictive coding ROI is to estimate the cost of discovery after the predictive coding solution is adopted.


Starting with the total documents to review after culling of 1,600,000, we apply a standard predictive coding seed set review and sampling percentage of 30% meaning you will actually only have to physically review 30% of the 1,600,000 documents or 480,000 (remember predictive coding automates much of the review process). Then using the “Documents/Hour Review Rate” and “Hourly Billing Rate per Reviewer” from Table 1, we can calculate the cost per discovery event after predictive coding; [480,000 / 50] * $55 = $528,000 per discovery event or $1,584,000 annually.

The last step in calculating predictive coding ROI is to plug the calculated numbers into the ROI formula:

[Cost of discovery – Cost of discovery after predictive coding solution] – Cost of predictive coding solution / Cost of predictive coding solution or [$5,280,000 - $1,584,000] – $400,000 / $400,000 = 8.24 or 824%


Looking at this specific predictive coding ROI, 824 percent is wildly positive, so this organization would be remiss not to adopt predictive coding technology to quickly reduce their cost of e-discovery review. In fact, taking these calculations a bit further, we can see that for this specific example, a breakeven time of 0.13 years, [the total cost of the predictive coding solution / the annual cost savings] * 1.2, or a little more than a month to pay for the predictive coding solution from the estimated savings.


Return on investment is an often asked for but little understood financial measure. Many equate cost savings to ROI but cost savings is only a part of the equation. ROI also includes looking at the cost of the solution that produced the savings. ROI lets you compare returns from various investment opportunities to make the best investment decision

Calculating the ROI of predictive coding is straight forward and usually will produce a largely positive ROI due mostly to the reduction in documents legally trained professionals must actually review. But, another reason to consider predictive coding for legal review is its increased accuracy and consistency. Increased accuracy provides you a more complete data set during early case assessment and greatly reduces your risk of incomplete discovery response.