Big Data
The above symbolizes just how much data companies have to deal with today. A lot. They're drowning in it. Well a few are, others are finding ingenious ways to spin data into revenue. The modern deluge of data is a mine of money - understanding it is key to understanding the modern business world, so read on.
Why Is Big Data Important?
This is a blog about the future of business, so
first off, show me the money! $300 billion is the amount that the magic of Big
Data could provide the US 
Big Data isn't just about making existing business models
more profitable; Big Data has generated fancy new ways of making money. There are companies that aim to use social media data to predict political upheaval . Risky and complicated investments are now
feasible for companies to evaluate and thus invest. One example of a big data backed investment is that made by a Seattle 
Far from a black future of snooping and omniscient
organizations knowing all our personal information, I see a bright future with
big data. Annoying ads will be extinct. Persistently plaguing me with images of
how happy are those who consume the premium household surface cleaner is a
waste of money for the advertiser and tedious for me. However, I wouldn't mind
ogling the new smartphones—those I just might buy. If you have access to enough
data about me, you just might be able to send ads my way that don’t make me
pine for soviet Russia 
What Exactly Is Big Data?
What Exactly Is Big Data?
Big Data are data sets that are too large for
conventional storage hardware and analysis software to handle. Today that
typically means in-between a dozen terabytes and a handful of petabytes.  So far so unexciting, a larger spreadsheet doesn't sound revolutionary. Well, Big Data’s fundamental magic property is to
allow us to see the future. Induction is a leap of logic that leads us from
past observation to future prediction. The more past observations (i.e. data)
we have, the more reliable our prediction is.[2]
The more varied and voluminous our observations, the more nuance and detail we
discover in the properties of things and the laws of nature. The bigger the
data, the more reliable our predictions of the future. But what’s so special
about this amount of bigness? Why am I bombarded with the Big Data buzz now? 
The answer can be found in this paper on the ‘The Unreasonable Effectiveness of Data’. In it, Googlers Peter Norvig, Alon Halevy
and Fernando Pereira argue that Big Data has a special significance for the
social sciences. Where physics and chemistry has had spectacular success with
elegant mathematical formulae able to precisely describe how a kind of
fundamental physical event will happen, the social sciences have failed. Historically
our attempts to ‘formalize’ the social sciences have all failed. The heart of wave mechanics is schrodinger's single one line equation. A rule book of the grammar of the English
language runs to over 1700 pages. Attempting to ape physics is totally the
wrong approach, argue Google’s pioneers of machine translation. Attempting to
teach the machine the precise rules for which expressions are grammatical or
semantic equivalents isn’t going to work. Giving it vast amounts of data and
simple prediction rules will make it a far more endearing conversationalist. To
illustrate the scale of this shift, our Googlers tell of the excitement of
dealing with the Brown Corpus in the undergrad years. The brown corpus is a set
of over a million words – carefully tagged and catalogued. In 2006 Google
released a trillion word corpus with frequency counts for sequences less than
five words. This was messy data, but vastly more useful for deriving
predictions.  
All this is relevant for business because the domain
of the social sciences is where money is made. Atoms don’t buy anything[3].
Imagine how business would act if they could predict with a simple one line
equation exactly what percentage productivity increase would result from
different management styles. Such an equation has not been found, but perhaps
with vast amounts of social media data we have something just as good. 
How Much Data Are We Talking Here?
The
amount of data business are hoarding is vast and it’s increasing rapidly, and
the rapidity with which it is increasing is increasing, rapidly. According to
McKinsey the amount of global data produced will increase by 40% each year John
Gantz and David Reinsel predict in ‘The Digital Universe’ that the amount of
data in the world is currently doubling every two years. 15 out of 17 sectors
in the US  have more data
stored per company than the US 
Google's Software Magic
One of the big factors in today’s Big Data
revolution is Google’s creation of Map Reduce data processing software, and Yahoo’s
re-creation of this technique in the Apache Hadoop programme. Before Hadoop,
one’s go to data analysis solution was a relation database. Relational
Databases employ what is called a schema-on-write model. Before analyzing any
of your data you had to create a schema. You have to go through a load
operation which takes your original data and converts it into that schema. Once
your data is in this format reading it is fast and simple. However it can only
be read in this format. You can only ask it questions that you built into the
schema you loaded it with. Asking new questions requires loading the data with
a different schema and thus costs money, quite a lot of money. Hadoop as a
programme is a powerful alternative to relational databases and has
complementary properties. Hadoop has a ‘schema-on-read’ model, meaning that it
can load the data straight away with no waiting for any schema to be applied.
Loading the data with Hadoop is far quicker, reading the data is slower.  
The really cool thing about Hadoop is that it
functions by distributing the data and the analysis process across numerous
different computers, vastly reducing the time necessary to finish the task. This
is why far more data is available for cheap analysis today. The Hadoop
Distributed File System works by chopping up files (default 64mb in size) and
then spreading those blocks throughout the numerous clusters, replicating each
block a number of times (the default is three). The operation is run over the
sum of all these computers – carrying out the right operation on the right file
at the right stage of the process. The system is built and optimized for
putting files into the database, getting them out, deleting files but not
updating current files. These assumptions allow us to achieve the scalability.
The more computers, you add to the cluster, the briefer will be the analysis
process. Even cooler, the code you use for a handful of computers, will work
just as well on a hundred.   
Now that I’ve explained the basics, next Friday I
will regale you with a list of the most fantastic uses of Big Data to be found
in the corporate world.
If you've found anything I've not got quite right, or something I've missed that really ought to have been included, please leave a comment! Be warned I will take your silence as your testimony that I am perfect!
If you've found anything I've not got quite right, or something I've missed that really ought to have been included, please leave a comment! Be warned I will take your silence as your testimony that I am perfect!

 
No comments:
Post a Comment