Big Data
The above symbolizes just how much data companies have to deal with today. A lot. They're drowning in it. Well a few are, others are finding ingenious ways to spin data into revenue. The modern deluge of data is a mine of money - understanding it is key to understanding the modern business world, so read on.
Why Is Big Data Important?
This is a blog about the future of business, so
first off, show me the money! $300 billion is the amount that the magic of Big
Data could provide the US
health industry. €250 billion could be conjured up for the public sector of EU
countries, and a chunky 60% increase in profit is in store[1] for retailers who convene with the black magic of big data. All these Big Data predictions are produced by those splendid chaps at McKinsey, to illustrate how
much money organizations could stand to make with Big Data techniques. Whatever Big Data is, it’s evidently worth a
lot of money.
Big Data isn't just about making existing business models
more profitable; Big Data has generated fancy new ways of making money. There are companies that aim to use social media data to predict political upheaval . Risky and complicated investments are now
feasible for companies to evaluate and thus invest. One example of a big data backed investment is that made by a Seattle
based Health Insurer was able to invest money in a voluntary wellness and
health scheme for its employees and make money from it. The reduction in health
claims alone saved the company more than the amount spent on in the incentives
to join. The company estimates gaining $1.5 million from increased productivity
from its employees’ own estimation of the effect the programme had.
Far from a black future of snooping and omniscient
organizations knowing all our personal information, I see a bright future with
big data. Annoying ads will be extinct. Persistently plaguing me with images of
how happy are those who consume the premium household surface cleaner is a
waste of money for the advertiser and tedious for me. However, I wouldn't mind
ogling the new smartphones—those I just might buy. If you have access to enough
data about me, you just might be able to send ads my way that don’t make me
pine for soviet Russia .
Or, what about a smartphone that knows your voice so well that it can tell when
you’re upset, and notify your friends?
What Exactly Is Big Data?
What Exactly Is Big Data?
Big Data are data sets that are too large for
conventional storage hardware and analysis software to handle. Today that
typically means in-between a dozen terabytes and a handful of petabytes. So far so unexciting, a larger spreadsheet doesn't sound revolutionary. Well, Big Data’s fundamental magic property is to
allow us to see the future. Induction is a leap of logic that leads us from
past observation to future prediction. The more past observations (i.e. data)
we have, the more reliable our prediction is.[2]
The more varied and voluminous our observations, the more nuance and detail we
discover in the properties of things and the laws of nature. The bigger the
data, the more reliable our predictions of the future. But what’s so special
about this amount of bigness? Why am I bombarded with the Big Data buzz now?
The answer can be found in this paper on the ‘The Unreasonable Effectiveness of Data’. In it, Googlers Peter Norvig, Alon Halevy
and Fernando Pereira argue that Big Data has a special significance for the
social sciences. Where physics and chemistry has had spectacular success with
elegant mathematical formulae able to precisely describe how a kind of
fundamental physical event will happen, the social sciences have failed. Historically
our attempts to ‘formalize’ the social sciences have all failed. The heart of wave mechanics is schrodinger's single one line equation. A rule book of the grammar of the English
language runs to over 1700 pages. Attempting to ape physics is totally the
wrong approach, argue Google’s pioneers of machine translation. Attempting to
teach the machine the precise rules for which expressions are grammatical or
semantic equivalents isn’t going to work. Giving it vast amounts of data and
simple prediction rules will make it a far more endearing conversationalist. To
illustrate the scale of this shift, our Googlers tell of the excitement of
dealing with the Brown Corpus in the undergrad years. The brown corpus is a set
of over a million words – carefully tagged and catalogued. In 2006 Google
released a trillion word corpus with frequency counts for sequences less than
five words. This was messy data, but vastly more useful for deriving
predictions.
All this is relevant for business because the domain
of the social sciences is where money is made. Atoms don’t buy anything[3].
Imagine how business would act if they could predict with a simple one line
equation exactly what percentage productivity increase would result from
different management styles. Such an equation has not been found, but perhaps
with vast amounts of social media data we have something just as good.
How Much Data Are We Talking Here?
The
amount of data business are hoarding is vast and it’s increasing rapidly, and
the rapidity with which it is increasing is increasing, rapidly. According to
McKinsey the amount of global data produced will increase by 40% each year John
Gantz and David Reinsel predict in ‘The Digital Universe’ that the amount of
data in the world is currently doubling every two years. 15 out of 17 sectors
in the US have more data
stored per company than the US
library of congress- 235 terabytes. So where’s it all coming from? Social media
is a big culprit. Every twitter, every photo, every video, every comment and
every ‘like’ is being logged. The increased digitization of businesses is
another. With computers being used whenever possible in business- every
transaction and and transportation is being logged and stored as data.
Smartphones, now outselling dumb phones, are another fountain of data. Logging
every user’s location ensures that there is a huge amount of data to be used. The
internet of things is yet another source of data sure to surge. Gartner predicts that we will see 50 billion machine voices added to
the internet. The internet of things is the increasingly large network of
sensors built into the billions of devices we use.
Google's Software Magic
One of the big factors in today’s Big Data
revolution is Google’s creation of Map Reduce data processing software, and Yahoo’s
re-creation of this technique in the Apache Hadoop programme. Before Hadoop,
one’s go to data analysis solution was a relation database. Relational
Databases employ what is called a schema-on-write model. Before analyzing any
of your data you had to create a schema. You have to go through a load
operation which takes your original data and converts it into that schema. Once
your data is in this format reading it is fast and simple. However it can only
be read in this format. You can only ask it questions that you built into the
schema you loaded it with. Asking new questions requires loading the data with
a different schema and thus costs money, quite a lot of money. Hadoop as a
programme is a powerful alternative to relational databases and has
complementary properties. Hadoop has a ‘schema-on-read’ model, meaning that it
can load the data straight away with no waiting for any schema to be applied.
Loading the data with Hadoop is far quicker, reading the data is slower.
The really cool thing about Hadoop is that it
functions by distributing the data and the analysis process across numerous
different computers, vastly reducing the time necessary to finish the task. This
is why far more data is available for cheap analysis today. The Hadoop
Distributed File System works by chopping up files (default 64mb in size) and
then spreading those blocks throughout the numerous clusters, replicating each
block a number of times (the default is three). The operation is run over the
sum of all these computers – carrying out the right operation on the right file
at the right stage of the process. The system is built and optimized for
putting files into the database, getting them out, deleting files but not
updating current files. These assumptions allow us to achieve the scalability.
The more computers, you add to the cluster, the briefer will be the analysis
process. Even cooler, the code you use for a handful of computers, will work
just as well on a hundred.
Now that I’ve explained the basics, next Friday I
will regale you with a list of the most fantastic uses of Big Data to be found
in the corporate world.
If you've found anything I've not got quite right, or something I've missed that really ought to have been included, please leave a comment! Be warned I will take your silence as your testimony that I am perfect!
If you've found anything I've not got quite right, or something I've missed that really ought to have been included, please leave a comment! Be warned I will take your silence as your testimony that I am perfect!