Tokutek's John Partridge: Open Source Is Vested in Big Data
"One problem about Big Data is that people want to deal with a bunch of data that is not neatly structured into columns with fields that hold a person's first name, last name, etc. That structured data works well for information that does not change very often. In the world of Big Data, a lot of the data comes from server logs or the navigation history of a particular visitor to your website."
Ask Tokutek CEO John Partridge what makes open source such a snug fit for the database industry and for Big Data, and he'll tell you it is the decision-making by engineers that use open source.
"For people who for whatever reason really need to access the latest technology, most purchasing decisions today are made by very capable engineers," Partridge told LinuxInsider. 'They have little patience for the need to pay for an evaluation or a series of sales reps at meetings and so forth. They want to get to the good stuff right away.
"So for innovative companies in the database space now, I do not think they have much of a choice" but to use open source, he added.
Partridge sees open source products as a key solution for businesses that need to work with database technology. That need becomes even more critical when the database involves Big Data and relies on cloud storage. For example, open source pioneer Mozilla has been using TokuDB to manage its MySQL-driven Datazilla Data cluster, an open-source system for managing and visualizing performance data.
Tokutek is an open source player in high-performance and agile database storage engine technology. The company was founded by a group of mathematics researchers seven years ago who had a breakthrough in an obscure piece of computer science called indexing technology, according to Partridge. The breakthrough enables databases to make data quickly retrievable. They found a 200x speed increase in database operations that nobody had been able to improve since the 1970s.
To understand the significance of that feat, think of a phone book. Of course, phone books are indexed by last name -- otherwise, you might have to go through the entire phone book just to find the one number you needed. Looking by last names, however, makes the searching task very easy and fast.
Likewise, applications that need to find data very quickly depend on indexing to do it.
In this interview, LinuxInsider talks with Partridge about the inherent problems associated with expanding database operations to fit ever-increasing user demands and aging database technology.
LinuxInsider: What gives open source solutions a marketing edge over proprietary and commercial software?
John Partridge: Different customer segments have a different appetite for new technology. Some are not going to need the latest and greatest technology. They will be content with using traditional methods for many years to come. I think if you want to interact with your customers quickly and constructively and get the word out there quickly that your stuff is good and it works, the only way to do it is to make it easily accessible to the decision makers who are engineers.
Engineers are receptive to open source so they can download it and try it out. They can run their own evals and call the developer if they have a problem or question. By and large, they are pretty self-sufficient and very thorough. They put software through very rigorous testing. They look at all the different open source offerings. I think that open source is a good fit for the database space today. This is how the new technology is getting evaluated by the decision makers in an organization.
LI: What caused a rift between database practitioners into SQL and NoSQL groups?
LI: Aside from that comfort zone, what about technical differences or platform problems?
Partridge: The other reason had to do with performance. People were going nuts over how slowly the database performed as the amount of data it contained got bigger and bigger. MySQL wound up right smack in the middle of this situation. People would build applications that worked great with small databases, but as the data got bigger and bigger, suddenly the performance would fall off a cliff.
The response left few options. People had to buy a lot more RAM or shorten their databases, or cut back on their indexes. If they cut back on their indexes, they lost the advantages of putting all of that data in the database to begin with. That disappointment in performance led to the third reason for going to NoSQL: That was to try to break down the data processing requirements and solve them without going to a database using a relational model that is behind SQL.
LI: It seems that what you describe is not endemic to just open source databases today.
Partridge: The underlying fault was not just MySQL. All the commercial databases run on the same data structure that was developed in the 1970s called the B-tree. The solution was to develop better indexing technology. That brought us full circle.
LI: It also seems that hype over Big Data is causing some confusion. The NoSQL/Hadoop crowd defines Big Data one way. The traditional database vendors define it as something else. Why are these groups so far apart? Or is that a misconception?
Partridge: I don't think it is a misconception at all. Big Data is kind of this umbrella term. If you consider the situation, people latch on to different parts of the Big Data animal. Some have the head; others have the tail. Big Data is a very expansive -- I would say overly broad -- term.
I would say Big Data simply reflects the situation we are in right now. The demands and the means for data processing in the business world completely overwhelm what is available from the traditional set of vendors and the traditional set of products. Because there is such a big gap -- and that gap in some instances is widening -- people say there are a lot of Big Data problems and the technology is not up to snuff. So it is an overly broad term. Maybe we can pare it down to issues that are more recognizable and people will speak more consistently about those subparts of the Big Data problem.
LI: Can you give an example of this?
Partridge: One problem about Big Data is that people want to deal with a bunch of data that is not neatly structured into columns with fields that hold a person's first name, last name, etc. That structured data works well for information that does not change very often. In the world of Big Data, a lot of the data comes from server logs or the navigation history of a particular visitor to your website. Much of it is unstructured and does not fit that structured model well at all. It is harder, therefore, to process. So that is one flavor of Big Data -- the needs to digest and manage lots of data that does not fit the traditional structure of a relational database.
LI: What other needs must Big Data meet?
Partridge: Another flavor of Dig Data is the accommodation of huge amounts of data, so it is not just the phone book -- it is the phone book that is several feet thick. Plus the fact that you are adding more and more data to that phone book, maybe thousands of entries, per second.
Then, as if it were not bad enough, you want to query that database and expect it to come back in a couple of milliseconds delay. This is a combination of the size of the data and the complexity of the query. The point is that depending on the audience, they are going to care more about one of these flavors than the other. So when you read articles about Big Data, it is more of a reflection of who is speaking to what topic.
LI: Does improving database indexing locally have any impact on that same data if stored in clouds?
Partridge: That issue overlaps with us a little bit. We're talking to service providers who sell database service offerings. They like our technology because it allows them to run essentially the same software but at much lower costs due to our compression, insertion performance and such.
LI: What are some of the trends you see taking place in this industry?
Partridge: I think that the biggest trend is the issue involving how Big Data overwhelms database technology. I think it is going to get worse before it gets better. I also think the appetite for Big Data is growing very very rapidly. Its rates are growing faster than are its abrupt breakthroughs in technology.
One big breakthrough was the arrival of flash memory. Engineers did a great job developing a new kind of storage memory that enables databases to run a lot faster. We think we are bringing a very powerful new technology to the industry with our new indexer that makes databases much more tractable than they were before.
LI: Is there a conceivable limit to database capacity?
Partridge: I think the appetite for data processing will continue to grow at a fantastic rate, and it puts pressure on all of us to bring new technologies to market. There is a huge demand.