In Part 1 of this series, data mining was introduced as a new methodology for how legal departments can deal with big email. When matters arise that require legal teams to quickly and inexpensively uncover key facts, data mining can be leveraged as an alternative to the standard e-discovery process. By finding just the small subset of critical data in a universe of hundreds of millions of emails, legal teams can better form case strategy, respond quickly to internal issues, identify compliance violations proactively, and more.

With the basis that data mining is a better way for legal teams to approach e-discovery at the outset of a matter, this article will discuss the principles of using data mining for finding facts quickly.

It’s not comprehensive review

Legal teams must change their paradigm first and foremost. Often there is a temptation to just dive in and start reviewing individual documents, rather than spending time on the broad brush analysis which data mining allows. There are too many documents and too little time to look at every single document for every single detail — if necessary that will happen later. Counsel must instead spin this up as an entirely new and separate approach, and remember that in the beginning the key facts are the goal. Once the important information is known, the team can determine whether or not full scope comprehensive reviews needs to be conducted (and can then do it more efficiently leveraging the knowledge already gained). In sum — think differently.

Visualize to summarize

Visual analytics can enable quick and painless research work within the document population. Data mining visualizations present groups of data that share similarities and most important present summaries of those similarities. When looking at a subset of the emails that are all about ‘X,’ and it is known that ‘X’ is irrelevant, the legal team can quickly dismiss that group of documents. Conversely, if a subset shares similarities about ‘Y’ and ‘Y’ is likely relevant, those documents can then be mined more closely on just that subset — again identifying sets in this subset which can be ignored, and actually laying eyes on only a tiny fraction of the total docs to determine if key facts are contained there. By doing this, qualified researchers on the matter can easily know what to ignore and what needs closer examination.

Exclusion is key

The key to the power of data mining is enabling the chaff to be easily identified and ignored at a bulk level. Use the visual analytics — which summarize the contents and trends of the population based on what’s in the documents themselves — to identify large portions of the data which can be ignored with good reason. Then focus in on only the small portion of the total which stands a chance of being key. This requires being willing to make big decisions: The goal is to temporarily exclude large sets of documents from consideration. (Note the below as well: You can always come back to them if needed.)

Reuse existing knowledge

Data mining allows the reuse of existing knowledge as a clue toward finding more information. This can be done by using visualizations to identify documents which are similar to or trend with documents which are already known to be important, whether the similarity is content based or based on other features of the document. Similarly, once counsel already knows a key piece of information that matters to the case, the analytics can quickly identify additional content that is related in a meaningful way and help the team to understand where else useful data might be found.

Follow the growing web of information

Continuing on the method of leveraging existing knowledge, visual analytics can lead counsel down a path that provides more and more valuable data, growing the base of what is known about a matter. The process is similar to that of running a case law search; searching for and reading one case that leads to identifying 10 more cases to read, then those 10 cases lead to more similar cases, and as the searches iterate and knowledge grows eventually it becomes clear that most of the key pieces of information have been identified and it is appropriate to stop the iterative process.

Relying on data mining instead of the standard e-discovery approach when only key facts are needed at the outset can enable legal teams to take a set of millions of emails and very quickly reduce those down to only the tens or hundreds or thousands of emails that matter. This way of bringing information to the surface can often uncover surprising leads as well. In one recent case, a client followed this approach at the outset of a matter and surprisingly found almost immediately that everyone involved had been using a similar code-word in communications about the issue. The visual analytics identified clusters of emails around this particular code-word, in volumes which would not have been expected for such an innocuous word, which allowed the team to see exactly what was going on. With that information, the data could then be searched and the subset related to the code-word quickly explored.

Despite the out of control data volumes enterprises are dealing with today, getting to the facts quickly has never been easier given the advancements in analytics technology, and provides a radical improvement over the traditional and time consuming and expensive comprehensive review. While effective and sound data deletion and retention policies are critical in addressing the issue of big email, many organizations are unable to tackle information governance in a way that truly reduces volumes, or even after they do are left with significant volumes of information. Considering the inherent challenges of finding any valuable information in millions upon millions of emails, embracing the change in process and adopting data mining is a uniquely compelling way to respond to the demand to find key facts quickly.