We go through the exercise of electronic discovery to reduce volumes of data into useful trial evidence. The problem is we’ve been using the wrong tools to do it. Traditionally, counsel savvy about e-discovery have interviewed key information custodians with the goal of developing a list of words and phrases (hereinafter, “keywords”) to be applied against the data set. In theory, applying a keyword filter captures most (if not all) of the relevant information while screening out irrelevant information.
The problem with screening data with keywords is that it doesn’t do what we assume it does. Applying a keyword filter to a data set (i.e., a Boolean search) is simultaneously over- and under-inclusive. Language (and human beings’ use of language) is inconsistent and imprecise. Keyword/Boolean searches are over-inclusive in that a simple Boolean search lacks the “intelligence” to differentiate between synonyms (e.g., you search for documents related to an insect infestation at an apple orchard, and you end up with thousands of documents about Apple computers).
As a result, your screened data set usually contains thousands of documents that are completely irrelevant to the case, so your review costs soar. But scarier (for us litigator types) is the fact that keyword searches are under-inclusive. Keyword/Boolean searches are under-inclusive by nature because they assume:
- People use the same words to refer to the same or similar concepts
- People spell things similarly.
. . . both of which we know are not necessarily true. (Blair & Maron, An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval System, 28 Com. A.C.M. 289 ). For example, attorneys and paralegals involved in a subway accident case used keyword methodology to search a 350,000-page (40,000-document) database. Id. The litigation team believed they had located 75 percent of the relevant documents, while a separate manual review of the documents found that they had identified only 20 percent of the relevant documents. Id.
Most of the “missed” internal communications about the subway accident included terms and phrases like the “unfortunate incident,” the “disaster,” the “event,” the “situation,” the “problem” and the “difficulty” and never mentioned the “subway” or the “accident.” Id. Your keyword-screened document set may also be missing close to 80 percent of the documents that are particularly relevant to your case.
Luckily, there’s a solution to this problem: concept searching. Concept searching software uses algorithms to build a language model unique to each document set. Once the algorithm is applied to the document set, it tells the user which concepts (rather than key words) to look for. Applying the language algorithm reduces the problem with synonyms because the software is “intelligent” enough to determine whether the data is relevant based on the context in which it was used.
Most importantly, however, the software informs the user about additional words and phrases (i.e., “disaster,” “difficulty” and “unfortunate incident” from the example above), which relate to the same concept the user is attempting to explore, thereby yielding a more useable data set.
Concept searching is not new. E-discovery vendors have been marketing some form of concept searching under the rubric of “clustering” and/or “data analytics” since the early 2000s. Since then, concept searching software has become much more sophisticated and identifies related concepts much more clearly than its predecessors.
Indeed, Herbert Blutenthal of OrcaTec has used his company’s concept searching technology on many data sets and reports that he “always learns things when [he] takes on a new collection.” Even the U.S. Department of Defense has recently licensed a type of clustering/data analytic software to help make sense of the volumes of unclassified information in its data archives.
The moral of this article: Although the cost of concept searching may be high, it has the potential to yield more relevant information than the “keyword”/Boolean method. Before you completely dismiss the idea because of the price tag, you ought to know what you’re giving up.