Friday, 23 August 2013

Big Data

Big Data

The above symbolizes just how much data companies have to deal with today. A lot. They're drowning in it. Well a few are, others are finding ingenious ways to spin data into revenue. The modern deluge of data is a mine of money - understanding it is key to understanding the modern business world, so read on. 

Why Is Big Data Important?

This is a blog about the future of business, so first off, show me the money! $300 billion is the amount that the magic of Big Data could provide the US health industry. €250 billion could be conjured up for the public sector of EU countries, and a chunky 60% increase in profit is in store[1] for retailers who convene with the black magic of big data. All these Big Data predictions are produced by those splendid chaps at McKinsey, to illustrate how much money organizations could stand to make with Big Data techniques.  Whatever Big Data is, it’s evidently worth a lot of money.

Big Data isn't just about making existing business models more profitable; Big Data has generated fancy new ways of making money. There are companies that aim to use social media data to predict political upheaval . Risky and complicated investments are now feasible for companies to evaluate and thus invest. One example of a big data backed investment is that made by a Seattle based Health Insurer was able to invest money in a voluntary wellness and health scheme for its employees and make money from it. The reduction in health claims alone saved the company more than the amount spent on in the incentives to join. The company estimates gaining $1.5 million from increased productivity from its employees’ own estimation of the effect the programme had.

Far from a black future of snooping and omniscient organizations knowing all our personal information, I see a bright future with big data. Annoying ads will be extinct. Persistently plaguing me with images of how happy are those who consume the premium household surface cleaner is a waste of money for the advertiser and tedious for me. However, I wouldn't mind ogling the new smartphones—those I just might buy. If you have access to enough data about me, you just might be able to send ads my way that don’t make me pine for soviet Russia. Or, what about a smartphone that knows your voice so well that it can tell when you’re upset, and notify your friends?

What Exactly Is Big Data?

Big Data are data sets that are too large for conventional storage hardware and analysis software to handle. Today that typically means in-between a dozen terabytes and a handful of petabytes.  So far so unexciting, a larger spreadsheet doesn't sound revolutionary. Well, Big Data’s fundamental magic property is to allow us to see the future. Induction is a leap of logic that leads us from past observation to future prediction. The more past observations (i.e. data) we have, the more reliable our prediction is.[2] The more varied and voluminous our observations, the more nuance and detail we discover in the properties of things and the laws of nature. The bigger the data, the more reliable our predictions of the future. But what’s so special about this amount of bigness? Why am I bombarded with the Big Data buzz now?

The answer can be found in this paper on the ‘The Unreasonable Effectiveness of Data’. In it, Googlers Peter Norvig, Alon Halevy and Fernando Pereira argue that Big Data has a special significance for the social sciences. Where physics and chemistry has had spectacular success with elegant mathematical formulae able to precisely describe how a kind of fundamental physical event will happen, the social sciences have failed. Historically our attempts to ‘formalize’ the social sciences have all failed. The heart of wave mechanics is schrodinger's single one line equation. A rule book of the grammar of the English language runs to over 1700 pages. Attempting to ape physics is totally the wrong approach, argue Google’s pioneers of machine translation. Attempting to teach the machine the precise rules for which expressions are grammatical or semantic equivalents isn’t going to work. Giving it vast amounts of data and simple prediction rules will make it a far more endearing conversationalist. To illustrate the scale of this shift, our Googlers tell of the excitement of dealing with the Brown Corpus in the undergrad years. The brown corpus is a set of over a million words – carefully tagged and catalogued. In 2006 Google released a trillion word corpus with frequency counts for sequences less than five words. This was messy data, but vastly more useful for deriving predictions. 

All this is relevant for business because the domain of the social sciences is where money is made. Atoms don’t buy anything[3]. Imagine how business would act if they could predict with a simple one line equation exactly what percentage productivity increase would result from different management styles. Such an equation has not been found, but perhaps with vast amounts of social media data we have something just as good.

How Much Data Are We Talking Here?

The amount of data business are hoarding is vast and it’s increasing rapidly, and the rapidity with which it is increasing is increasing, rapidly. According to McKinsey the amount of global data produced will increase by 40% each year John Gantz and David Reinsel predict in ‘The Digital Universe’ that the amount of data in the world is currently doubling every two years. 15 out of 17 sectors in the US have more data stored per company than the US library of congress- 235 terabytes. So where’s it all coming from? Social media is a big culprit. Every twitter, every photo, every video, every comment and every ‘like’ is being logged. The increased digitization of businesses is another. With computers being used whenever possible in business- every transaction and and transportation is being logged and stored as data. Smartphones, now outselling dumb phones, are another fountain of data. Logging every user’s location ensures that there is a huge amount of data to be used. The internet of things is yet another source of data sure to surge. Gartner predicts that we will see 50 billion machine voices added to the internet. The internet of things is the increasingly large network of sensors built into the billions of devices we use.

Google's Software Magic

One of the big factors in today’s Big Data revolution is Google’s creation of Map Reduce data processing software, and Yahoo’s re-creation of this technique in the Apache Hadoop programme. Before Hadoop, one’s go to data analysis solution was a relation database. Relational Databases employ what is called a schema-on-write model. Before analyzing any of your data you had to create a schema. You have to go through a load operation which takes your original data and converts it into that schema. Once your data is in this format reading it is fast and simple. However it can only be read in this format. You can only ask it questions that you built into the schema you loaded it with. Asking new questions requires loading the data with a different schema and thus costs money, quite a lot of money. Hadoop as a programme is a powerful alternative to relational databases and has complementary properties. Hadoop has a ‘schema-on-read’ model, meaning that it can load the data straight away with no waiting for any schema to be applied. Loading the data with Hadoop is far quicker, reading the data is slower. 

The really cool thing about Hadoop is that it functions by distributing the data and the analysis process across numerous different computers, vastly reducing the time necessary to finish the task. This is why far more data is available for cheap analysis today. The Hadoop Distributed File System works by chopping up files (default 64mb in size) and then spreading those blocks throughout the numerous clusters, replicating each block a number of times (the default is three). The operation is run over the sum of all these computers – carrying out the right operation on the right file at the right stage of the process. The system is built and optimized for putting files into the database, getting them out, deleting files but not updating current files. These assumptions allow us to achieve the scalability. The more computers, you add to the cluster, the briefer will be the analysis process. Even cooler, the code you use for a handful of computers, will work just as well on a hundred.   

Now that I’ve explained the basics, next Friday I will regale you with a list of the most fantastic uses of Big Data to be found in the corporate world.

If you've found anything I've not got quite right, or something I've missed that really ought to have been included, please leave a comment! Be warned I will take your silence as your testimony that I am perfect!

[1] forgive me
[2] Quite why this is so remains a mystery.
[3] Well not on their own they don’t.