C. The number and nature of documents or communications in the system or index which are not searchable as a consequence of the system or index being unable to extract its full text or metadata.
D. Any limitation in the system or index, or in the search syntax to be employed, tending to limit or impair the effectiveness of keyword, Boolean or proximity search in identifying documents or communications that a reasonable person would understand to be responsive to the search.
A court will permit "discovery about discovery" like this when a party demonstrates why an inadequate index is a genuine problem. So, let's explore the rationale behind each inquiry:
A. Tokenization Rules. When machines search collections of documents for keywords, they rarely search the documents for matches; instead, they consult an index of words extracted from the documents. Machines cannot read, so the characters in the documents are identified as "words" because their appearance meets certain rules in a process called "tokenization." Tokenization rules aren't uniform across systems or software. Many indices simply don't index short words (e.g., acronyms). None index single letters or numbers.
Tokenization rules also govern such things as the handling of punctuated terms (as in a compound word like "wind-driven"), case (will a search for "roof" also find "Roof?"), diacriticals (will a search for Rene also find René?) and numbers (will a search for "Clause 4.3" work?). Most people simply assume these searches will work. Yet, in many search tools and archives, they don't work as expected, or don't work at all, unless steps are taken to ensure that they will work.
B. Stop Words. Some common "stop words" or "noise words" are simply excluded from an index when it's compiled. Searches for stop words fail because the words never appear in the index. Stop words aren't always trivial omissions. For example, "all" and "city" were stop words; so, a search for "All City" will fail to turn up documents containing the company's own name! Words like side, down, part, problem, necessary, general, goods, needing, opening, possible, well, years and state are examples of common stop words. Computer systems typically employ dozens or hundreds of stop words when they compile indices.
Because users aren't warned that searches containing stop words fail, they mistakenly assume that there are no responsive documents when there may be thousands. A search for "All City" would miss millions of documents at All City Indemnity (though it's folly to search a company's files for the company's name).
C. Non-searchable Documents. A great many documents are not amenable to text search without special handling. Common examples of non-searchable documents are faxes and scans, as well as .tiff images and some Adobe PDF documents. While no system will be flawless in this regard, it's important to determine how much of a collection isn't text-searchable, what's not searchable and whether the portions of the collection that aren't searchable are of particular importance to the case.
If All City's adjusters attached scanned receipts and bids to email messages, the attachments aren't keyword searchable absent optical character recognition.
Other documents may be inherently text-searchable but not made a part of the index because they're password-protected (i.e., encrypted) or otherwise encoded or compressed in ways that frustrate indexing of their contents. Important documents are often password-protected.
Subscribe to Law Technology News













