Data is duplicative by nature, but the way your operation stores and manages data is likely exposing it to unnecessary and costly redundancy. Most organizations handling e-discovery today could very well have a cumulative data set that is anywhere from five to 10 times bigger than necessary.
The following are the areas impacted by data sprawl, why it occurs and what you can do about it.
Where ESI Data Gets Duplicated
Redundant data typically shows up in two forms: duplicative original data and export/import duplication.
Duplicative Original Data
It is estimated that more data has been created in the past two years than previously existed in all of time. The exponential increase in data volumes will continue to impact e-discovery. Duplicative original data is almost always guaranteed through the collection process as discoverable data is being harvested for potential relevance. While culling and de-duplication are not novel concepts, much of the duplication is a result of data management and workflows associated with e-discovery.
As data is ingested for normalization, most standalone systems repeatedly create additional copies of the same data and store them on the designated file systems. If an email is sent to 10 people, 10 copies of that email and any associated attachments will be collected and processed. If any of those attachments were saved locally by any of the recipients, those files will also be duplicative. As a result, a significant amount of duplicative orginal data is being stored in various places within an organization.
If data volumes are not properly taken into consideration while establishing e-discovery protocols, traditional workflows will always result in an increase in the duplication of original data.
Most organizations rely on multiple applications for handling ESI during various aspects of their discovery processes. Each time an application is used to perform data processing, analysis, review or production, data is exported from the preceding application and imported into the next.
Quantifying the amount of redundant data being created is dependent on a particular workflow. It is not uncommon for a data set to expand five or six times in a simple two-product system. Now imagine stepping through a workflow that uses three to four products.
The result: Significant data sprawl.
If you know or simply suspect that your organization is exposed to data sprawl, taking a closer look at your current e-discovery ecosystem is a good first step. Then double down on targeting the two areas causing the majority of redundant data.
Move to Single-Instance Solution
Even within a single system, data is often automatically duplicated when stored, creating a considerable amount of redundancy.
Regardless of de-duplication protocols, legacy solutions store each duplicative file for retrieval at any given time. A single-instance solution is able to identify these duplicative records and only store a single instance of each rendition associated with the record. In the case of the 10 emails and corresponding attachments from above, a single-instance solution will only store a single copy of the duplicative email and direct all duplicative records to that single file.
Platforms that offer single-instance storage not only identify duplicates as a reporting method, but they apply that analysis and only maintain unique files on the server.
Consider an End-to-End Software Solution
And finally, consolidating functionality into a single platform simplifies how you interact with your data from management to workflow and can eliminate or significantly reduce duplication due to export/import between solutions.
Elie Francis, founder and CEO, ONE Discovery, Inc. is a strategic, data-driven technology professional with over 15 years experience in the legal technology industry. After co-founding Driven, Inc. in 2001, Francis went on to spin off ONE Discovery to build a platform with e-discovery professionals’ needs and wants at the forefront of the development.