Big data analytics
People create 2.5 quintillion bytes (2.5 * 1018) of data, or nearly 2.3 million Terabytes of data every day, so much that 90 percent of the data in the world today has been created in the last two years alone. Furthermore, rather than being a large collection of disparate data, much of this data flow consists of data on similar things, generating huge data-sets with billions upon billions of observations. Big Data refers not only to the deluge of data being generated, but also to the astronomical size of data-sets themselves. Both factors create challenges and opportunities for
This data comes from everywhere: physical sensors used to gather information, human sensors such as the social web, transaction records, and cell phone GPS signals to name a few. This data is not only big but is growing at an increasing rate. The data used in this book, namely, Twitter data, is no exception. Twitterwas launched in March 21, 2006, and it took 3 years, 2 months, and 1 day to reach 1 billion tweets. Twitter users now send 1 billion tweets every 2.5 days.
This may come as a surprise in light of the contemporary excitement
surrounding Big Data. The reason for the large number of small data-sets is that data that is not socially generated and publicly displayed is time consuming and expensive to collect. As such, academics, businesses, and other organizations with data needs tend to collect only the minimum amount of information necessary to gain purchase on their questions. These data-sets are usually small and focused and are curated by the rganizations that use them; they usually do not plan on updating or adding fresh
data to them. The poor management of these data often leads to their misplacement, thereby generating dark data—data that is suspected to exist or ought to exist but is difficult or impossible to find. The problem of dark data is realand prevalent in the myriad of small, locally collected data-sets. The utter lack of central management of data in the tail of the data size distribution invariably causes these sets of data to be forgotten.
Big Datadiffers substantially from other data not only in its size andvelocity, but also in its scope and density. Big Data is large in scope, that is, it is created by everyone and by itself and thus is informative about a wide audience. This characteristic makes it very useful for studying populations, as the inferences we can make generalize to
large groups of people. Compare that with, say, opinions gleaned from a focus group or small survey. These opinions, while highly accurate and easy to obtain, may or may not be reflective of the views of the wider public