Twill on Apache: A New Weave
The Apache Foundation accepted the open source project Twill, named "Weave" while hosted on Github, after foundation members began voting for its inclusion on Nov. 8. Bringing the Twill project to Apache incubation will ensure that Twill is more accessible to application developers, according to the project's sponsors.
Twill enables programmers to design tailored applications that sit on top of YARN to extract data and simplify Big Data processes. The YARN and Twill combination is one of several application solutions developed to work with large databases on Hadoop.
Twill's mission is to ignite the next generation of Big Data application development by making Hadoop accessible to all developers, according to Andreas Neumann, chief architect at Continuuity, a cloud-based Big Data application platform maker for developers. Twill initially was developed at Continuuity under the "Weave" name.
"We think the broader community of Java developers will greatly benefit from Twill. So, we open-sourced it," Neumann wrote in a company blog post.
Rooted in Closed Source
Continuuity is a Hadoop company focused on the developer. The company develops platforms and tools for building Big Data applications using simple APIs.
Many of its products, such as the application server, are proprietary. However, the application server has numerous open source components within it, noted Continuuity CEO Jon Gray.
YARN is an open source application that allows the Hadoop cluster to turn into a collection of virtual machines. Weave, developed by Continuuity and initially housed on Github, is a complementary open source application that uses a programming model similar to Java threads, making it easy to write distributed applications. In order to remove a conflict with a similarly named project on Apache, called "Weaver," Weave's name changed to Twill when it moved to Apache incubation.
"Weave as a project will eventually die on Github. Twill will take over on Apache," Gray told LinuxInsider.
One of the drawbacks to Hadoop 1.0 was a tendency to hog cluster space. Hadoop created clusters with more compute capacity than the user's Big Data needs required. The large cluster mostly remained unused, explained Gray.
Hadoop 2.0 addresses this issue by separating cluster resource management from MapReduce. This is a data processing paradigm for condensing large volumes of data into useful aggregated results. Hadoop's resource manager, YARN, allows using the Hadoop cluster for any computing needs, including distributed testing, stress-load generation or other types of analysis.
"We are trying to enable developers to build what we think is an oncoming wave, where Hadoop is not just used in your warehouse but becomes a regular part of your active data applications that you are doing," Neumann explained.
Making Hadoop Better
YARN and Hadoop are very low level. They have kernel-type APIs where the onus for use is really on the developer. Hadoop is a very powerful system, but it has focused on doing some things well and not on doing everything, according to Gray.
For example, in order to build an application on top of YARN, you have to write the master and then the children levels. You also need to create that protocol.
"Communicating with the process on the cluster and managing it if it goes down is left to the application developer to do," said Gray. "To solve that programming situation, Continuuity built a number of applications to run on top of YARN."
Those applications help programmers get around one of the big challenges with Hadoop: data ingestion and adding APIs to make data accessible outside the process, said Gray.
How Twill Works
Twill functions as a scaled-out proxy. Twill is a middleware layer in between YARN and any application on YARN, according to Gray. YARN itself makes no assumptions about what you are building.
"Twill makes some assumptions. Since Twill is Java-specific, it assumes you are building a Java application. YARN is not Java-specific," Gray said.
When you develop a Twill app, Twill handles APIs in YARN that resemble a multi-threaded application familiar to Java. It is very easy to build multi-processed distributed applications in Twill. That is essentially what Continuuity is going after.
"This is how we will bring familiar paradigms to Hadoop. This will allow developers to more easily build applications to run on top of Hadoop," he said.
Promising Approach, Maybe
The Hadoop architecture needs an easier abstraction vehicle over resource management. YARN is just one of them. The Mesos Project on Apache is another cluster management option, noted Chris Wensel, CTO and founder of Concurrent.
"When I first learned about Weave (now Twill), I was pretty excited about it. It is highly probable that we will see this market create additional ones," Wensel told LinuxInsider. "We need an easier path to create clusters on demand with YARN. Allowing other developers to join the party is extremely important."
One uncertainty is whether enough large companies will drive a widespread demand for YARN and Twill. It is a question of how many people have a large-enough cluster in need of repartitioning. Add to that the need to do this dynamically all the time.
Right now, large companies like Yahoo and Twitter have that need. If nothing else, positioning Twill with YARN and Hadoop in Apache may add more visibility for the related projects.
"We work with some really large banks that are trading off between YARN and some others to see which one they really want. Some of our other customers have both YARN and another choice and use them both simultaneously," Wensel noted.
Significance of Apache Incubation
Apache is not an architecture body that creates best practices standards, explained Wensel. Apache has no discrimination concerning projects it agrees to take. It holds a couple of projects that overlap with each other.
Apache's collection of open source software includes active projects and some that are not very good. Apache's role is to make sure projects are not abandoned and that they move forward. If projects are not being utilized, Apache moves them into what is called the "attic."
"Apache is a body that does the management and administrata around these projects to make sure they remain quality. Apache has done a very good job. It has a really good infrastructure. It is managed very professionally. It is a marketing strategy," said Wensel.
A sort of symbiotic connection exists between the names and functions of Weave and Twill and the connection to YARN and Big Data analysis. All three terms are interwoven in the construction of fabric patterns. Leap from that to the elements that make up large clusters of Big Data.
"When YARN was introduced in Apache Hadoop, it became possible to deploy new types of distributed processing applications on Hadoop clusters. However, it turns out that writing these kinds of applications can be a complex proposition involving -- amongst other things -- managing [remote procedure calls] between containers, log management and lifecycle management," Tom White, an engineer at Cloudera, told LinuxInsider.
Twill makes it easier to write programs that can take advantage of YARN. Twill uses a simple thread-based model that Java programmers will find familiar. YARN can be viewed as a compute fabric of a cluster, which means YARN applications like Twill will run on any Hadoop 2 cluster, including Cloudera's CDH 4, he explained.
Twill is all about the development of Hadoop 2.0 and YARN. Contrast that to Hadoop 1.0 being all about map averages.
"The challenge today with YARN out of the box is that it is really hard on developers to build new applications on top of YARN. So basically they have to learn YARN and fully understand how it works in order to build a new app on it," Gray said.
"The hope is that new users coming to Hadoop and existing developers will be able to create new app" he added. "This will lead to people putting their Web servers and application servers on top of Hadoop. Yarn opened up the possibility. Twill enables developers to seize that opportunity."