The efficacy and defensibility of so-called “predictive coding” has been a hot topic in light of Magistrate Judge Andrew Peck stating the he “has approved of the use of computer-assisted review” in Da Silva Moore v. Publicis Groupe, and Magistrate Judge Nan Nolan conducting a still ongoing evidentiary hearing in Kleen Products v. Packaging Corp. of America, in which the plaintiffs seek to force the defendants to start over with this technology after they already used a traditional Boolean methodology.
Judge Peck’s ruling in Da Silva Moore certainly is positive about statistical document sorting technology. These “predictive” workflows leverage new technology that learns about the substantive relationships between documents based on coding decisions made by humans. But the ruling comes from a case in which the parties, at least initially, already agreed to use the technology and were arguing only over the particulars of the protocol to be followed. Moreover, the order does not adopt or approve of any particular protocol, tool or technology. Judge Peck resolved certain disputes about how to proceed with the process initially, but he reserved judgment on whether those initial steps would be sufficient until after those steps are completed. Contrary to what some have suggested, the holding is quite limited. But there is an important takeaway: Resolution of a dispute over how to use statistical document sorting technology and “predictive” workflows looks much like the resolution of a dispute over how to use Boolean technology.
In traditional arguments over Boolean searching, the parties come into court with competing proposals for searches, hit rates, unique hit rates, document counts, and sometimes samplings of documents resulting from the disputed search terms. These quantitative metrics are used in conjunction with qualitative arguments about the probable importance of various proposed search terms to advocate for competing search term lists. The judge exercises his wide discretion in discovery matters, splits the baby and tells the parties to come back if they still have disputes after doing what the court has ordered.
Likewise, in Da Silva Moore the parties came into court with competing proposals to “stabilize the training of the software” and to create the initial seed set used to train the software. Instead of hit rates and such, the arguments focus on “statistical confidence levels,” how many “iterative rounds” of human coding should be done to adequately teach the algorithm, how many of the documents humans should review and code in each of those rounds, and at what point the algorithm should be trusted to have found substantially all of the important documents that the humans will then review and code. In addition to the quantitative metrics, there were qualitative arguments such as whether the defendant should review all or only some of the documents the computer will return in the final round. Judge Peck exercised his wide discretion in discovery matters, split the baby and told the parties to come back if they still have disputes after doing what the court has ordered. So the first hotly contested ruling on the use of statistical document sorting technology and “predictive” workflows looks much like the more familiar disputes over Boolean methods. This should give comfort to new adopters of this technology.
One concern in this case is that the defendant agreed to allow the plaintiff to review the documents that the reviewers code as not responsive in each iterative round. If this becomes the price for judicial permission to use these methods, then it may not be worth it.
Another workflow to leverage statistical document sorting technology is the use of concept clustering in which a party will sort an entire dataset into concept clusters at the outset. Senior associates review each concept cluster, exclude those that do not promise to be relevant and promote to traditional linear review only the clusters that promise to be relevant. This process may be significantly more defensible because humans perform due diligence on each concept cluster; clicking into the cluster, skimming the metadata fields in the list view and sampling documents as needed. This is analogous to the time-honored process lawyers and paralegals use when reviewing boxes in warehouses. If a box could be skimmed and found to be irrelevant, then it would be set aside without reviewing every single page just to definitively rule out the possibility that something relevant might have been misfiled in that box. Moreover, this clustering workflow can cull 90 percent or more of a large dataset, which is better than many examples that have been touted where a “predictive” workflow was used.
Whichever workflow is used, do not be overly defensive about using statistical document sorting technology. It is not only cheaper, but in all likelihood will produce better results.