For many database practitioners, Hadoop is turning the tables on the relational database model. The rise of Big Data is driving what some see as a much-needed change in the platforms that process the massive infusions of aggregated raw data.
Take for example, Cloudera founder and Chief Strategy Officer Mike Olson. His open source company harnesses Apache Hadoop-based software and services to offer a powerful new data platform that enables enterprises and organizations to look at all their data, both structured and unstructured.
In his pre-Hadoop days, Olson worked in the relational database industry. Prior to founding Cloudera in 2009, he was the CEO of Sleepycat Software, an open source embedded database engine. Once Sleepycat Software was acquired by Oracle, Olson spent two years as vice president before moving on.
Along the way, he realized that the relational database field was not producing a lot of innovation — it had completely missed the opportunity to grow with the development of Big Data. Olson does not deny that his investment in relational databases provided his business back then with a comfortable living, but he felt constrained working in that environment.
The relational database industry refined what existed but was not disruptive, he observed. The challenge was limited to satisfying the needs of the biggest customers.
“Big data is going to transform virtually every human endeavor in the coming years,” Olson said earlier this year in a classroom talk at the Stanford Graduate School of Business.
In this interview, LinuxInsider and Olson discuss how the Hadoop platform is allowing data munchers to crunch data in more intense ways.
LinuxInsider: What is it about Hadoop that is revolutionizing how we process Big Data?
It took the consumer Internet to invent a platform that really changed the rules of the game. We believe that the opportunity for Big Data managed by Hadoop infrastructure is every bit as big as the opportunity for relational data. It stands to reason that if you could make tens of billions of dollars a year doing min/max/average on numeric data stored in a table, you should be able to make that kind of money doing behavioral analytics over complex user interactivity data from the Web.
LI: What is different about the concept of Big Data that surpassed the tools in relational databases and opened the door to Hadoop?
The business opportunity for managing Big Data is at least as big as the relational market. The Hadoop platform was purpose built to attack that problem. It handles complex data arranged in new ways in native format using a variety of tools. We were fortunate to be the first folks who saw that opportunity and started a company and got in front of it.
LI: What was the initial response to a new way of managing data?
At that time, nobody else believed that Hadoop was going to be a big deal. Fast-forward five years, and you can see that the rest of the market agrees with us. Certainly there is a lot of activity — lots of venture-backed companies and big-dollar big-enterprise companies staking out the Hadoop space now.
LI: Why have proprietary solutions not kept pace with Big Data-capable databases?
We believe pretty deeply in the open source philosophy. I think Hadoop has been successful for a few reasons. First, it does unlock value in data doing new analytics in a way that was never possible before. But the economics of free to download, free to install and free to use combined with the faster innovation from the broader community that the open source community creates, compared to any single vendor, is just not a fair fight. My money is on the open source community.
LI: What makes open source not a fair fight?
All the smart people work someplace else. No matter how good you are at hiring, there are people in Uzbekistan that are beyond your reach. Add the fact that not just me and my company can figure out the right thing to work on. The entire planet with lots of innovative minds all over the place can direct their attention to their problems to enhance the platform in particular ways to make their life better. The aggregate benefit is that it is just not a fair fight. That is not to say that Hadoop is perfect in every possible way. There is much that we as a community need to do to continue to harden it for enterprise use cases and extend it to new work models.
LI: Should consumers fear the rise of Big Data as intrusive and another tool to extract money from their pockets by vendors?
I think that is a misconception. The way that the platform is being used in general is to make better decisions about existing problems by considering more information. Here is an example: detecting fraud in financial transactions. That has mattered enormously to credit card processors and the banks for a very long time. They have built a lot of very expensive infrastructure for attacking it. But if you can all of a sudden look at the last 10 years of transaction activity in your network in order to recognize patterns of abuse in stuff that would normally be invisible over shorter periods of time, you can see patterns emerge that you can exploit. You do not have to sample any longer. You do not have to look at 10 percent of 1 percent of the data. You can afford to look at everything exhaustively.
LI: So Big Data per se is not the ultimate evil doer aimed at privacy infractions?
The conviction that lots and lots of my deepest and darkest personal secrets are being collected I am a little bit skeptical about. At the supermarket where you buy, and the stuff you look at on the Web have all been known. It just has not been able to be aggregated in any way, and it has not been able to drive delivery of better targeted information aimed at your particular interests in the way that it can be today.
LI: Then Big Data has an image problem rather than poses a real threat to consumers?
This platform was born to basically optimize content delivery and advertising for the consumer Internet. My conviction is that advertising is only an annoyance as long as it is not really spot on. As soon as it is just what you want, it is not really advertising any more. It is information — and you want it. The ability of content providers and banks and insurance companies for fraud and malfeasance detection, and hospitals to design and deliver better care based on their knowledge of who you are — I want that stuff. I want to see that happen.
LI: What is driving that push for Big Data’s rapid growth?
It is a perfect storm of three major trends that we watched unfold over the last 10 years. First of all, there is just flat out more data in the world than ever before. Data used to be generated on a human scale. HR would hire somebody, and a record would be created. You buy a thing at a store, and a purchase transaction would happen. These days, machines talk to machines much more. Data is getting generated at machine scale. For example, where I live I have a smart meter on my house. It is constantly talking back to the utility. Now you can buy smart appliances. There are smart refrigerators and smart dishwasher machines. So the amount of information streaming around the network is exploding, because there is more and more automation and more sensors. So lots of data is trend one.
LI: What is the second trend pushing big Data?
Trend two — just by magic, we no longer buy computers the way we used to. It used to be when you wanted someplace to put your data you called a microsystems shop, and they would back up their truck with their most expensive server. You would write them a big fat check, and all of your data would go on that thing. The way we build our data centers today is [with] industry standard inexpensive servers that we can incrementally scale out. So instead of buying one big box, we order racks. And when those racks are not big enough to do what we want them to today, we make a phone call to order some more racks. So we are incrementally able to grow our storage and computing investment.
LI: So lots of data and bigger, better storage leads to what final trend?
This was deus ex machina. It was the miraculous arrival of the Hadoop platform. It was able to catch this data and exploit it in a way that was never possible before. Google had this problem (of having lots of data with no place to store it) before anybody else and invented a platform to attack it and then gave it away. So those three things — all just blind luck — happened simultaneously.
LI: What are the biggest challenges facing the goals of Big Data today?
I am an old database guy. The biggest single challenge we face with the adoption of Hadoop today is the quality of the applications and the tools that run on it. If you look at the status of the relational database industry in the mid 1980s, we were in much the same circumstance. There was this new platform that could store data in new ways and surface it to users via new tools. No one knew those tools. No one knew those languages. There were no applications that ran on it. The tools to exploit it were very thin on the ground. We solved that problem over several decades by training a bunch of SQL programmers and making people be DBAs and giving them the skills to do that. Today we are doing those same things in Big Data. But most of all, we solved the problems with software. We are seeing an identical transformation now. That is really what is going to drive adoption.