The amount of data we are creating as society has exploded over the last decade. Consider this fact: Each day, we create more than 70 times the amount of information in the Library of Congress. Or this one: Approximately 2.5 billion Internet users generate 2.5 quintillion (2,500,000,000,000,000,000) bytes of data every day. Why are we producing so much data? Because we can.
Bandwidth, computer memory, and computer-processing capabilities have improved exponentially over the last decade. By 2016, it is estimated that the gigabyte equivalent of all movies ever made will cross global IP networks every 3 minutes. The average smartphone now has more computing power than NASA did when Neil Armstrong landed on the moon. At the same time, each one of us is a walking content generator. Our use of the Internet, social media, and mobile devices is creating a tsunami of electronic data. And, as mobile devices get smaller, faster, and more powerful, they will enable us to generate even more bytes of “likes” every year.
While the mere existence of so much data is interesting on a phenomenological level, it is not, in economic terms, worth much. The key development of big data analytics is our growing ability to turn this data into valuable information. In order to understand these analytics, it is helpful to have a little background in the history of data management and analysis.
When companies moved from paper records to electronic records, they needed a system to store, manage, and analyze data — and thus, structured data was born. Structured data is digital information that has been organized into a common and intentionally designed structure or scheme. Examples of this include stock-trading data, customer relations management systems, customer-sales orders, supply-chain documentation, and inventory data.
All digital data that could not be put in a form that was easily manipulated or analyzed became known as unstructured data. Both humans and machines generate unstructured data. Many of today’s software applications and electronic devices create machine data that users do not even know about. This machine-generated data typically contains information regarding applications’ or devices’ status and activity. For example, smart meters automatically send data regarding electronic usage by a household to a server located at the electric company. Other examples of machine-generated data include search data, network data, and health monitor or medical device data. This machine-to-machine communication is becoming known as “the Internet of Things.”
Compared to machine-generated data, human-generated data is infinitely more difficult to manage and organize. It varies widely in its structure, format, nomenclature, and style. It is also more context dependent that any other data source. Often, it is necessary to understand something about the data’s context in order to understand the data itself. Examples of unstructured human-generated data include emails, text messages, video files, and social media feeds.
As long as there has been digital data, businesses have been trying to analyze it. In the 1960s and 1970s, IBM and Oracle developed relational database software. A relational database is simply a table with rows and columns that allows the user to categorize, compare, and analyze data. Relational databases can also be linked together to form several layers of related data. These databases are still the primary means for businesses to analyze data. An Excel spreadsheet, for example, is a basic relational database.
One of the key characteristics of a relational database is that it does not analyze data in its native or original form. In other words, relational databases require that data be entered into the database after it has already been analyzed and processed by the user.
The problem is that processing data in order to log it in a database can take a long time and be very expensive. If a company wants to analyze its customer relations data using a relationship database, it has to make an investment in IT infrastructure and staffing to build the database, link it to other data systems within the company, and build the interface. Because this process is so resource intensive, companies got in the habit of only saving data that was immediately valuable, cost effective to analyze, or had to be kept for risk-management purposes. It simply wasn’t worth saving data where the potential cost of preserving, collecting, and structuring the data outweighed its immediate analytical value.
Another limitation of having to process data before it goes into a relational databases is that decisions about the design of the database have to be made before any data has been analyzed. In essence, you have to ask the questions first and review the data later. Thus, the value of a relational database depends on whether you’ve asked or anticipated the right questions from the beginning.
What has changed over the last several years is our ability to analyze unstructured data. Based on advances in computer technology and computer science, we can now analyze massive amounts of data in real time using analytics that are not limited to relational interfaces. In other words, big data analytics lets us analyze data in its native state without having to boot-strap it into the columns and rows of a relational database. In a macro sense, this greatly enhances our ability to analyze more kinds of data much faster. But even more important, because we can review the data in its native state, we can see patterns and relationships that are not limited by prior suppositions, biases, or assumptions. Instead of defining what data is relevant before seeing the data itself, as occurs with a relational database, we can now let the data speak for itself.