This story was originally published on Aug. 7, 2012, and is brought to you today as part of our Best of ECT News series.
Mention big data and the first thing that might come to mind is Hadoop. The open source software framework has recently enjoyed a great deal of popularity among vendors and enterprise users.
However, if it is to really be useful to the enterprise, Hadoop may need to be taken out of open source, argues Brian Christian, chief technology officer of Zettaset.
“The community serves its needs, not the needs of the enterprise,” Christian told LinuxInsider. “Meanwhile, the enterprise has the dollars to drive innovation on its need. In between this chasm are companies like Zettaset that follow and contribute to the community, yet sell features and functionality in our software packages that the community is not focused on.”
One to Many
The Apache Hadoop framework enables distributed processing of large datasets across clusters of computers. It’s designed to scale up from single servers to thousands of machines.
Each computer in a Hadoop cluster offers its own local computation and storage. Hadoop is designed to detect and handle failures at the application layer so it will continue to be highly available even if some of the computers in the hardware clusters fail.
Hadoop changes the economics and dynamics of large-scale computing because it’s scalable, cost effective, flexible and fault tolerant, IBM says. New nodes can be added as and when required without the need to change data formats; commodity servers can be used to conduct massively parallel computing. Hadoop has no schema, so it can absorb any type of data from any number of sources. When a node goes down, the system redirects work to another instance where the data is stored.
Yahoo and Facebook are among the largest users of Hadoop; others include Amazon, eBay, the Federal Reserve Board of Governors, LinkedIn, Microsoft, Netflix, Twitter, HP, IBM and Apple.
“Speaking only for Cloudera, four of the top five commercial banks, four of the top five general retailers, four of the top five entertainment companies and all three of the tier-one telecom carriers run Cloudera,” Charles Zedlewski, vice president of product at Cloudera, a company that provides Hadoop-based software and services, told LinuxInsider. “A number of” large government agencies and small data-driven startups also run Cloudera.
It’s a Commitment
Hadoop is driven by enthusiasts, so “the dedication in personnel, training and machine costs is more than most companies are willing to undertake,” Zettaset’s Christian said. “The fact that Hadoop has basic lack of enterprise services like security, fault tolerance and high availability services also hinders its adoption rate.”
Hadoop “brings great value to organizations, but when people say it is free, it is free like a puppy — it requires some care and feeding,” Steve Wooledge, senior director of marketing at Teradata Aster, told LinuxInsider. “The majority of companies we talk with are looking for ways to take advantage of the scale-out file system in Hadoop to store data but make it easier to access data in Hadoop for analysis without having to hire highly skilled data scientists.”
Teradata Aster has found that Hadoop “requires between five and ten times the hardware required to process data at the same rate as in a database management system (DBMS),” Wooledge continued. Further, Hadoop requires “some new systems administration skills to monitor and maintain which are different from database administrators.”
Even when commodity servers are used, Hadoop adds to data center costs. “You are looking at large amounts of data … [and] you also have to have space for all the indexes that might be created,” David Hill, principal at the Mesabi Group, told LinuxInsider.
“Note that Hadoop spreads out the data,” Hill pointed out. “Note also that data may be stored in multiple data stores. The extra copies also increase the need for storage and thus a large data center footprint.”
The difficulties enterprises experience in implementing Hadoop could just be par for the course.
“Most complex IT solutions require investment in people, machines and time,” Das Kamhout, IT cloud lead at Intel and technical advisor to the Open Data Center Alliance, told LinuxInsider.
“The development community is well aware of [Hadoop’s] shortcomings and advances in all areas are currently in progress, and should be ready for production deployment later this year,” Matt Aslett, a research manager at the 451 Group, told LinuxInsider.
Hadoop supporters “are working to train up a larger pool of Hadoop developers and administrators,” Aslett continued. “Cloudera alone has trained more than 12,000 people.”
Teradata believes in the open source model, the company’s Wooledge said.
The open source community “is not only made up of very capable individual programmers acting on their own, but also talented individuals who work for both large and small vendors who have an interest in making Hadoop more enterprise-ready,” the Mesabi Group’s Hill pointed out.
Further, “the biggest contributors to Apache Hadoop include vendors such as Hortonworks, Cloudera, MapR and IBM, all of which have a vested interest in driving greater enterprise adoption, as well as users such as Yahoo, Facebook and eBay, all of which stand to gain from its improved capabilities,” the 451 Group’s Aslett stated.
Perhaps Hadoop may not be able to leave the open source community “but you are seeing companies like MapR that are rewriting whole sections of the community’s code to sell to the enterprises,” Zettaset’s Christian suggested. “Forking the base is what you’re going to see in the future.”