Big Data and open source software may be the next great unholy alliance in computing’s current promised land, but open source is a broken business model that needs a better vehicle for supporting projects such as programming suites that build database applications.
So argues Chris Wensel, founder and CTO of Concurrent. Wensel started the company in 2008 to focus on the open-source Cascading Project,a framework that he created to better work with Big Data applications.
Wensel quickly discovered that he was part of a bad news / good news scenario. The bad news was that he began his company too early. The good news is the Big Data market is now growing.
Big Data is the process of collecting and analyzing huge volumes of structured and unstructured data. The data sets are so large that they defy handling with traditional database and software techniques.
Cascading is an open-source framework — Java library and runtime — that enables developers to simply develop rich enterprise-grade data analytics and data management applications. The application can be deployed and managed across a variety of Apache Hadoop computing environments on which database apps typically run.
Cascading was named by InfoWorld as the 2012 Best Open Source Software Winner. Wensel, meanwhile, is also the author of Scale Unlimited Apache Hadoop BootCamp, the first commercial Apache Hadoop training course.
In this interview, LinuxInsider talks to Wensel about the challenges of working with Big Data and the hurdles posed by open source software.
LinuxInsider: What can people do with Cascading for Big Data apps?
Chris Wensel: People are using Cascading for very traditional apps like ETL. More interestingly, machine learning, predictive modeling and recommendation engines are the prevalent technologies. Climate Corporation is predicting weather patterns; Upstream is doing marketing-spend revenue attribution; Twitter does ad-revenue quality and analytics. We even have seen companies doing gene-sequencing alignment stuff.
LI: How has Cascading addressed Big Data programming problems so far?
Wensel: When you are programming, it makes a lot of things just go away when you use Cascading. It does computation; it does scheduling; it does a lot of things. One of the great things about it is that you do not have to take a $50,000 training class to learn all about Hadoop when you can just use Cascading, use your existing Eclipse tools and hook in the monitoring tools. That is not true with any other projects that run on top of Hadoop.
LI: What led you to the solution Cascading provides?
Wensel: Cascading is a Java API for improving the experience and productivity of developers creating applications involving Big Data architecture. It solved a problem in programming that nobody else was dealing with.
I built Cascading to fix the things being done wrong in my previous data-handling jobs. From a developer’s perspective, I was able to build things with Cascading that made it easier to build this class of applications.
LI: How is Cascading related to what Big Data applications programmers do?
Wensel: Cascading is available for any developer. It just has to have its libraries pulled into their programming using standard developer tools. No money changes hands for that access.
What is happening now in the ecosystem is people are doing other applications on Cascading’s open source nodes. For example, Twitter created three new languages based on Cascading and uses them internally. eBay is using some of that stuff. No money is changing hands.
This is all the open source ecosystem at work. Cascading itself is just a development tool. The company created an SQL layer on top of Cascading, and we are working on some other libraries that will help data scientists.
LI: At what point do you start to deal with making money from Cascading as an open source business?
Wensel: We are working on runtime to help improve the performance and capabilities of these applications, and that will be the thing that we sell. The seed funding has kept us going. In the late summer we will announce a 1.0 or early access to our runtime product. It will primarily be focused on improving the experience of using our open source tool. It is not a repackaging of our open source tools. It will be powered by our open source tool, if you will.
LI: How does open source need to mature to better support the application developers that play in that field?
Wensel: From my perspective, open source is completely broken. The challenge is that I think open source needs to be remodeled. One side of it is with open source you shouldn’t get upset because people are commercializing it. You gave it away free. You put it under the Apache Foundation, and that is why the Foundation is there.
You don’t get mad at FedEx because they are using the road. FedEx does pay taxes that helps pay for the road. That makes sense. What we don’t do with open source is tax our users for using our open source so that our projects continue to drive independent of commercial interests. We tax the indirect commercial interests.
LI: How do you propose we change that?
Wensel: I don’t know how to frame that. I know I will get shot for saying that. I honestly wish like a credit foundation I could tax the user based on some mechanism they were comfortable with so I could continue to do my job.
Tax is a horrible word for that. I don’t know what it is. I have also called it licensing or support. Customers then say I don’t need support because there are no bugs.
LI: Are you second guessing your decision to start up a business under the open source model?
Wensel: This probably is not the answer you are looking for. I never wanted a startup, but I love the fact that I have one. I love my employees. I am having a great time. Yet it is not what I originally envisioned. It is what it is. It is marketing, and it is more than that.
LI: Given your views on the dark side of the open source business model, do you operate your business differently than others in your situation?
Wensel: Generally, I don’t take feature requests from folks at other companies. I do have agreements with eBay and Twitter and a few other companies and take some contributions. People run their businesses on Cascading. Typically people will buy their software from places like IBM because IBM is committed to making sure that it is going to work.
Where is the middle ground where I can be open source and people can see the code and learn from it and maybe even contribute to it? More importantly, tell me where it is broken so I can fix it and also earn a living?
LI: So are you saying that open source is not a good business model?
Wensel: Open source sends a message that yes, this code is open source, but there is money behind it, and this company is committed to your success and your business. Meanwhile, there are a lot of companies making really large applications.
I wish there were a couple I could mention. You actually probably have some right in your pocket. They use our stuff to do things that saves them millions and millions of dollars. They don’t pay us, and that is okay, but I wrote the application so they could build their application on our stuff and trust it.
LI: So why not be proprietary or fully commercial instead?
Wensel: I want the same clout as Oracle. I just don’t want that same infrastructure as Oracle. I can’t get there like Oracle did because they are ahead of us. So I have to take every means necessary to get there. Open sourcing is a great way to teach people how to write code, see how things work, and get contributions and get people to trust it. It is marketing as well, however. It is a lot of things. Just one thing is missing: It is not a very clean way to make money.
LI: So starting out as a commercial enterprise was not financially possible for you, or you did not want to go that way, or you did not realize that you should not have gone the way you did, or haven’t you worked that out yet?
Wensel: No, no. I knew exactly all of the pitfalls when I started the project. Honestly, the very first version of Cascading was under the GPL license and not the Apache license. The GPL license allowed me to figure out what the commercial opportunities were without actually giving away the farm.
I wasn’t sure if I was going to be gaining for not owning any IP. If I didn’t put it under the Apache Foundation, I would continue to own the copyright. We changed it to the Apache licensing because there were a number of companies that refused to run GPL code no matter what. Whether they were right or wrong, that is just the way the situation was.
I think GPL is fine. I do not think there were any issues with it except that IBM doesn’t like it. A lot of companies are going to follow what IBM recommends, which I totally agree with.
LI: Were there any negative factors you had to accept as a trade-off?
Wensel: Under Apache we still own the copyright but the license is much more permissive. We actually had to give up a revenue stream because of that. We would have earned a little bit of money from the OEM, but it probably was not going to scale the way I wanted.
Also, that is a very high-maintenance revenue stream. So if I made any mistake, it was in thinking that the OEM revenue was cheaper than it was. Everything I did in forming the company was calculated. I formed the company to get the protection, but it was a for-profit, not nonprofit.