In the wee hours, a beat cop sees a drunken lawyer crawling around under a streetlight searching for something. The cop asks, “What’s this, now?” The lawyer looks up and says, “I’ve lost my keys.”
They both search for a while, until the cop asks, “Are you sure you lost them here?”
“No, I lost them in the park,” the tipsy lawyer explains, “but the light’s better over here.”
I told that groaner in court, trying to explain why opposing counsel’s insistence that we blindly supply keywords to be run against the email archive of a Fortune 50 insurance company wasn’t a reasonable or cost-effective approach to electronic data discovery. The “Streetlight Effect,” described by David Freedman in his 2010 book “Wrong,” is a species of observational bias where people tend to look for things in the easiest ways. It neatly describes how lawyers approach e-discovery. We look for responsive electronically stored information only where and how it’s easiest, with little consideration of whether our approaches are calculated to find it.
Easy is wonderful when it works; but looking where it’s easy when failure is assured is something no sober-minded counsel should accept and no sensible judge should allow.
Consider the myth of the enterprise search. Counsel, within and without companies, and lawyers on both sides of the docket believe that companies have the ability to run keyword searches against their myriad silos of data: mail systems, archives, local drives, network shares, portable devices, removable media and databases. They imagine that finding responsive ESI hinges on the ability to incant magic keywords like Harry Potter. Documentum relevantus!
Though data repositories may share common networks, they rarely share common search capabilities or syntax. Repositories that offer keyword search may not support Boolean constructs (queries using “AND,” “OR” and “NOT”), proximity searches (Word1 near Word2), stemming (finding “adjuster,” “adjusting,” “adjusted” and “adjustable”) or fielded searches (restricted to just addressees, subjects, dates or message bodies). Searching databases entails specialized query languages or user privileges. Moreover, different tools extract text and index such extractions in quite different ways, with the upshot being that a document found on one system will not be found on another using the same query.
But the streetlight effect is nowhere more insidious than when litigants use keyword searches against archives, email collections and other sources of indexed ESI.
That Fortune 50 company — call it All City Indemnity — collected a gargantuan volume of email messages and attachments in a process called “message journaling.” Journaling copies every message traversing the system into an archive where the messages are indexed for search. Keyword searches only look at the index, not the messages or attachments; so, if you don’t find it in the index, you won’t find it at all.
All City gets sued every day. When a request for production arrives, they run keyword searches against their massive mail archive using a tool we’ll call Truthiness. Hundreds of big companies use Truthiness or software just like it and blithely expect their systems will find all documents containing the keywords. They’re wrong . . . or in denial.
If requesting parties don’t force opponents like All City to face facts, All City and its ilk will keep pretending their tools work better than they do, and requesting parties will keep getting incomplete productions.
A Better Way
To force the epiphany, consider an interrogatory like this:
For each electronic system or index that will be searched to respond to discovery, please state:
A. The rules employed by the system to tokenize data so as to make it searchable.
B. The stop words used when documents, communications or ESI were added to the system or index.
C. The number and nature of documents or communications in the system or index which are not searchable as a consequence of the system or index being unable to extract its full text or metadata.
D. Any limitation in the system or index, or in the search syntax to be employed, tending to limit or impair the effectiveness of keyword, Boolean or proximity search in identifying documents or communications that a reasonable person would understand to be responsive to the search.
A court will permit “discovery about discovery” like this when a party demonstrates why an inadequate index is a genuine problem.
So, let’s explore the rationale behind each inquiry:
A. Tokenization rules. When machines search collections of documents for keywords, they rarely search the documents for matches; instead, they consult an index of words extracted from the documents. Machines cannot read, so the characters in the documents are identified as “words” because their appearance meets certain rules in a process called “tokenization.” Tokenization rules aren’t uniform across systems or software. Many indices simply don’t index short words (e.g., acronyms). None index single letters or numbers.
Tokenization rules also govern such things as the handling of punctuated terms (as in a compound word like “wind-driven”), case (will a search for “roof” also find “Roof?”), diacriticals (will a search for Rene also find René?) and numbers (will a search for “Clause 4.3″ work?). Most people simply assume these searches will work. Yet, in many search tools and archives, they don’t work as expected or don’t work at all, unless steps are taken to ensure that they will work.
B. Stop words. Some common “stop words” or “noise words” are simply excluded from an index when it’s compiled. Searches for stop words fail because the words never appear in the index. Stop words aren’t always trivial omissions. For example, “all” and “city” were stop words; so, a search for “All City” will fail to turn up documents containing the company’s own name. Words like side, down, part, problem, necessary, general, goods, needing, opening, possible, well, years and state are examples of common stop words. Computer systems typically employ dozens or hundreds of stop words when they compile indices.
Because users aren’t warned that searches containing stop words fail, they mistakenly assume that there are no responsive documents when there may be thousands. A search for “All City” would miss millions of documents at All City Indemnity (though it’s folly to search a company’s files for the company’s name).
C. Non-searchable documents. A great many documents are not amenable to text search without special handling. Common examples of non-searchable documents are faxes and scans, as well as .tiff images and some Adobe PDF documents. While no system will be flawless in this regard, it’s important to determine how much of a collection isn’t text-searchable, what’s not searchable and whether the portions of the collection that aren’t searchable are of particular importance to the case.
If All City’s adjusters attached scanned receipts and bids to email messages, the attachments aren’t keyword searchable absent optical character recognition.
Other documents may be inherently text-searchable but not made a part of the index because they’re password-protected (i.e., encrypted) or otherwise encoded or compressed in ways that frustrate indexing of their contents. Important documents are often password-protected.
D. Other limitations: If a party or counsel knows that the systems or searches used in e-discovery will fail to perform as expected, they should be obliged to affirmatively disclose such shortcomings. If a party or counsel is uncertain whether systems or searches work as expected, they should be obliged to find out by, e.g., running tests to be reasonably certain.
No system is perfect, and perfect isn’t the e-discovery standard. Often, we must adapt to the limitations of systems or software. But you have to know what a system can’t do before you can find ways to work around its limitations or set expectations consistent with actual capabilities, not magical thinking and unfounded expectations.