Years after Judge Andrew Peck declared it to be “black letter law” in Rio Tinto Plc v. Vale S.A., 2015 U.S. Dist. LEXIS 24996, 8 (S.D.N.Y. March 2, 2015), technology-assisted review has finally entered the mainstream among a growing suite of technology-driven e-discovery tools. It is taking a bit longer, however, for practitioners to fully recognize that document review over large data populations is an information retrieval (IR) task.

Why does this matter? For one thing, it is important to understand that the scientific disciplines now marshaled for an effective information retrieval effort (data science, linguistics, and statistics) are essential to the successful conduct of document review. Once document-by- document manual review was augmented—or in some cases supplanted—by technological tools, it became a different exercise altogether; one that relies upon knowledge and expertise to properly execute. This understanding has been affirmed in the recently published ISO standard on e-discovery (see ISO 27050-3), which put it as follows:

An ESI review…is fundamentally an information retrieval exercise; an effective ESI review will therefore draw upon, as appropriate, the kinds of expertise that are brought to bear in information retrieval science. (emphasis added).

One of these IR competencies—expertise in sampling—plays an important role in information retrieval and is, in fact, at the heart of a successful document review effort. Although it may bring us back to specialized and sometimes arcane terminology that we may not have had occasion to think about since we passed Statistics 101 (e.g., mean, variance, standard deviation, confidence level, confidence interval), sampling, correctly applied, can provide decisive answers to questions that would otherwise cause us to lose sleep.

Applications of Statistical Sampling Outside Document Review

While sampling and statistical estimation can provide decisive empirical answers to questions related to document review (chiefly, but not only, through the provision of statistically-sound estimates of how effective a process is), they can also provide actionable answers to data-related questions practitioners may have in other contexts. (See Hedin, B., D. Brassil & A. Jones,  “On the Place of Measurement in E-Discovery” (2016), in J. R. Baron, R. C. Losey & M. D. Berman (eds.) Perspectives on Predictive Coding (pp. 375-416)). Large volumes of ESI have not only transformed the nature of the challenges posed by document review, they now present more general challenges to companies seeking efficient ways to extract the most useful information from their data. Such data dilemmas arise more and more frequently as organizations endeavor to engage in data management, reduction or automation efforts. Sampling can help.

Here are just a few real-world examples:

Case Study 1: Multiple Email Sources

Situation. In the face of impending litigation, Company A has collected and reviewed a large population of emails from its Exchange servers. The company subsequently realizes that its information archiving technology has created a data repository that could be a source of over 3 million additional in-scope emails. The company believes that emails in the second repository should be largely duplicative of the already-reviewed emails and so do not need to be reviewed. In order to decide whether or not to review the second archive, however, the company needs more than a hypothesis; it needs sound empirical guidance.

Solution. At first blush, just deduplicating the population seems like a good solution. However, an effective approach requires more than simple deduplication because, given that the two sources of emails have been processed by different technologies, standard deduplication techniques will leave many genuine duplicates unrecognized. Instead, the following exercise is designed, which uses sampling to help achieve statistically reliable results.

Starting from a large sample (approximately 120,000 emails) drawn from the un-reviewed archive, the effort proceeds progressively through stages of:

(1) automated exact deduplication (against the already-reviewed population);

(2) automated near-duplicate identification, and then;

(3) using a second smaller sample (4,000 emails) drawn from emails not yet identified as duplicative, manual review coupled with manual search.

The results of the exercise provide the company with a statistically-sound estimate of the prevalence of non-duplicative responsive material in the second repository. The study’s findings—that the upper threshold for the percentage of non-duplicative responsive documents in the second archive is extremely low (a fraction of a percent)—findings which hold at a high level of statistical confidence, provide Company A with the empirical grounds it needs to exclude the second archive from further review, saving untold hours and a significant spend.

This case study illustrates how a well-designed sampling exercise can provide a statistically-sound answer to a question a company has about its data in an efficient manner, making use of large samples in early stages when less costly automatic techniques can be applied, and then drawing smaller samples when more costly manual methods are required.

Case Study 2: The Accuracy of Document-Assembly Software

Situation. Company B uses automated document assembly technology as an aid in preparing notices that the company sends to users of its services. The company wishes to obtain a measure of the accuracy of the information contained in the notices, which is the product both of the document-assembly technology and of the quality-control performed by the individual ultimately responsible for preparing and sending the notice. The company would like a measure of the accuracy of the notices both in aggregate and broken down by individual preparer (so as to see whether there are any meaningful differences among individual preparers).

Solution. A study is designed whereby:

(1) the population of notices in-scope for the study is defined;

(2) the specific fields of information within notices that are to be assessed for accuracy are identified;

(3) a random sample of notices is drawn from the study population;

(4) the sampled notices are manually reviewed for accuracy against records known to contain the correct information, and;

(5) estimates (and associated 95 percent confidence intervals) for the mean accuracy of the notices, both in aggregate and broken down by individual preparer, are obtained.

The study’s findings—that the information in the notices is highly accurate (over 99 percent) and that there is no meaningful variation among individual preparers—findings which hold at a high level of statistical confidence, provide Company B with assurance that its notice preparation procedures are working as intended.

This case study illustrates how, once a meaningful metric is identified (here, mean accuracy of notices with respect to key fields), a sampling exercise can provide a simple and decisive answer to the question at hand.

Concluding Remarks

Practitioners are increasingly aware of the insights sampling can provide with regard to the effectiveness of their document review procedures. The two case studies we have briefly reviewed illustrate how sampling can also be leveraged to provide efficient but scientifically-sound answers to other questions companies or their counsel may have about their data. The potential applications of sampling to solve such problems are myriad.

In a world in which almost all aspects of the practice of law are impacted by big data, legal practitioners would be well-advised to be on the lookout for opportunities to take advantage of the power and efficiency that sound sampling protocols offer. More generally, legal practitioners would benefit from the recognition that the law-science collaboration that has helped them meet the challenges of eDiscovery can also help them meet other data-related challenges they and their clients face.

Bruce Hedin is principal scientist and Michael Morneault senior practice director at H5.