Although it’s struggling against both giant rivals like Google and smaller ones like Microsoft’s new search venture Bing, Yahoo is handing over the source code for its version of Hadoop to the community.
Hadoop, a top-level Apache project, is an open source distributed file system and parallel execution environment that lets its users process massive amounts of data.
Yahoo, which uses Hadoop extensively in its Web search and advertising businesses, has been a major contributor to the project.
A Leap of Faith
Yahoo announced the release of its Hadoop source code at the second annual Hadoop summit, held in Sunnyvale, Calif.
With Microsoft launching Bing so recently, and Google still on the king of the search mountain, is it wise for Yahoo to release the source code behind its search technology and other properties?
Well, it can’t hurt, at least. “I don’t think this could land Yahoo in more trouble,” Jeff Rogers, founder of Open Source Analysts Group, told LinuxInsider.
Hadoop is a free Java software framework that supports data-intensive distributed applications.
Yahoo, which says Yahoo Search is the world’s largest Hadoop implementation, says the application runs on more than 25,000 servers at the company and analyzes tens of billions of Web pages, multiple petabytes of storage and billions of new records every day.
Inspired by research papers about Google’s MapReduce and File System applications, Hadoop replicates data aggressively so that the data is not affected even when physical machines go down.
The TVA uses Hadoop to analyze data collected about phasor measurement unit (PMU) data throughout the eastern United States. A PMU measures the electrical waves on an electricity grid. It has about 15 TB of data filed and expects this to grow to about 40 TB by the end of next year.
Cloudera’s distribution of Hadoop includes the Yahoo source code, adding credence to Yahoo’s claim that it’s releasing the code to the community to speed up collaboration around open and collaborative research and development.
The Hadoop distributed file system is part of the Hadoop Core, the flagship sub-project of the Apache Hadoop project.
The project also includes Chukwa, a data collection system for managing large distributed systems; HBase, which provides a scalable, distributed database; Hive, a data warehouse infrastructure; Pig, a high-level data flow language and execution framework for parallel computation; and ZooKeeper, a highly available and reliable coordination system.