“Big Data,” a name for new data-analysis technologies as well as a movement to develop real-world uses for these capabilities, holds big promise. With regard to the practice of law, the impact of these technologies on electronic discovery is likely the first practical application that comes to mind. Managing the burdens of the information explosion, including volumes of data that made manual review impractical, expensive and less effective than necessary, was the last paradigm shift in the practice. With Big Data tools, the focus turns from managing the burden of large amounts of information to leveraging its value. See, e.g., John Markoff, “Armies of Expensive Lawyers, Replaced by Cheaper Software,” N.Y. Times, March 4, 2011.
Another practical application of Big Data will be to predict the outcome of disputes with a greater level of accuracy and granularity than now possible. In one interesting study, new insights into the U.S. Supreme Court’s jurisprudence were revealed through modeling and animating the cases the Court relied upon in its opinions over time. See Computational Legal Studies, “The Development of Structure in the Citation Network of the United States Supreme Court — Now in HD!.”
The analytical power of Big Data, however, also raises big concerns. For example, outside the practice of law, Big Data techniques have proven effective at suggesting new courses of action to battle illness. However, there is at least a chance that the results of such a study could backfire against the study participants, by enabling, for example, discrimination against those who are most likely to get sick. In one possible scenario, these results could provide a prospective employer with the information needed to identify potential hires who are most likely to get sick and miss work. See, e.g., Nicholas Bakalar, “What’s a Little Swine Flu Outbreak Among Friends?,” N.Y. Times, Feb. 3, 2011. Accordingly, it is important to question the impact of Big Data. As a start to this conversation, this article addresses several of the resulting privacy concerns.
First, one of the characteristics of Big Data is that the analysis of large data sets sometimes reveals new information that is not just a summation of the individual underlying information. Assuming that the data underlying a particular Big Data project are collected from publicly available sources, do the individuals who provided the underlying data have privacy rights in the new information obtained by analysis? A recent case, U.S. v. Maynard, 615 F.3d 544, 555 (D.C. Cir. 2010), suggests that they may.
In Maynard, the U.S. Court of Appeals for the D.C. Circuit considered whether evidence obtained by police through the warrantless use of a GPS device to track appellant Antoine Jones’ movements for a month was properly admitted. Jones argued that his conviction should be overturned because the use of the GPS device violated the Fourth Amendment prohibition of unreasonable searches. The prosecution countered by arguing that there was no Fourth Amendment violation, as Jones had no reasonable expectation of privacy when traveling on public thoroughfares. Id. at 556 (citing U.S. v. Knotts, 460 U.S. 276, 281 (1983)).
Distinguishing the totality of Jones’ movements from the individual journeys, considered in Knotts, the D.C. Circuit said there had been no actual exposure to the public because of the “low likelihood” that “anyone will observe all those movements.” Jones’ movements had not been constructively exposed either, as the “whole reveals more — sometimes a great deal more — than does the sum of its parts.” Id. at 558. Accordingly, Jones had a reasonable expectation of privacy in the sum of his moments even though he had no expectation of privacy in his individual movements. While Jones’ individual journeys on public roads are exposed to the public, the patterns of behavior revealed through analysis of all of these trips together remain private.
Maynard suggests that a person’s reasonable expectation of privacy in a data set arises, at a minimum, when the collection or compilation of the data set would not have been reasonably expected. The results of that collection reveal information specific to those individuals that cannot be discerned from the individual pieces of data constituting the set. Maynard further illustrates that the entity performing the analysis may be of critical importance, as Big Data collection and analysis by a governmental entity could raise constitutional search issues that would likely not be present if the study was conducted by a private actor.
Second, when publicly available information becomes more easily accessible as a result of Big Data, do the data providers retain privacy rights allowing them control over the continued use of the data? If there is a risk of continued harm by the republication, the answer is likely yes.
In Ostergren v. Cuccinelli, 615 F.3d 263 (4th Cir. 2010), the 4th Circuit considered whether a law that was applied to prevent the republication of Social Security numbers violated the First Amendment. The Social Security numbers were legally obtained by Betty Ostergren from publicly available online land records in the hopes of drawing attention to the government’s failure to redact the numbers before it posted the records.
The 4th Circuit, quoting the Supreme Court, noted that “there is a vast difference between the public records that might be found after a diligent search of courthouse files, county archives, and local police stations throughout the country and a computerized summary located in a single clearinghouse of information.” Id. at 284 (quoting U.S. Dep’t of Justice v. Reporters Committee for Freedom of the Press, 489 U.S. 749, 763 (1989)). Although these Social Security numbers were publicly available prior to the republication “[a]n individual’s interest in controlling the dissemination of information regarding personal matters does not dissolve simply because that information may be available to the public in some form.” Id. at 284 (quoting U.S. Dep’t of Def. v. Fed. Labor Relations Auth., 510 U.S. 487, 500 (1994)).
Consistent with Ostergren, even if a Big Data set is composed of publicly available information, one must consider whether the repeated disclosure of the data creates new opportunity for harm. If so, individuals may have lasting privacy interests in the data.
Finally, could Big Data’s disclosure of once-concealed identities fundamentally affect compliance with privacy regulations? In fact, it already has.
As background, a number of Big Data experiments have shown that anonymized data can sometimes be re-identified. See, e.g., Ethan Zuckerman, “Cynthia Dwork defines Differential Privacy,” Ethan Zuckerman’s blog, Sept. 29, 2010. For example, in 2006, Netflix published 10 million anonymized movie rankings as part of a contest to improve its recommendation system. Surprisingly, two researchers were able to identify the names of some of those whose data were contained in the supposedly anonymized data by comparing this data set with another publicly available data set. Bruce Schneier, “Why ‘Anonymous’ Data Sometimes Isn’t,” Wired, Dec. 13, 2007.
From a general point of view, these developments could affect the types of data that are considered personally identifiable, forcing a re-evaluation of the scope and application of many privacy rules. For example, the E.U. Data Protection Directive restricts the processing and distribution of “personal data,” which are data that identify a natural person “directly or indirectly.” Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 at ch. I, art. 2(a) and ch. II. With the possibility of de-anonymization, data that would not traditionally be thought of as personally identifiable may indirectly enable identification, and thus could be considered personally identifiable. See Paul Ohm, “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization,” 57 UCLA L. Rev. 1701 (2010), at 1741.
As another example, the Health Insurance Portability and Accountability Act (HIPAA) provides for the use of de-identified health information, which is exempt from regulation. 45 C.F.R. 164.514. A safe-harbor provision provides that data can be considered de-identified health information if 18 specific identifiers are suppressed or generalized. 45 C.F.R. 64.514(b)(2)(i)(A) – (R). However, de-identified health information could still potentially be used as part of a re-identification effort. See Ohm at 1740-41. Accordingly, HIPAA may actually provide fewer privacy protections than intended.
Unfortunately, this potential for harm already exists. It is perhaps this concern about re-identification that caused a court to recently declare, for the first time, that a zip code is personally identifiable information — an extreme result, leading to a flurry of new litigation. Pineda v. Williams-Sonoma Stores Inc., No. S178241 slip op. (Calif. Feb. 10, 2011).
Changes in technology frequently result in shifts in reasonable expectations of privacy. Just as airplane and satellite photography eroded the reasonable expectation of privacy a person might have had in their backyard, Big Data calls for a re-evaluation of privacy. We must make sure that the cost of these new capabilities is something with which we can all live.
A founding member of the Proskauer Rose electronic discovery task force, Nolan M. Goldberg is senior counsel, intellectual property and technology, in the litigation and dispute-resolution department and a member of the patent law group in New York. Micah W. Miller is an associate in the litigation and dispute-resolution department and intellectual property group in the firm’s Boston office.