Next-Gen Linux File Systems: Change Is the New Constant
Changes impacting storage are taking place at every layer of the network architecture: Disk drives are continuing on a Moore's law-like cost/capacity curve, yet concurrently we are also seeing the growth of solid-state technology to overcome the inherent performance limitations of mechanical disk; virtualization is changing how architects think about storage, as we now have operating systems running entirely in user space within a virtual machine; and applications are choosing HTTP over FC as the preferred storage protocol.
Given the changes taking place today, emerging file systems are employing new methods to address scalability, parallelism and new workload types.
While these changes are impacting storage and it is extremely important to understand where we are today, storage administrators also need to look forward to prepare for future changes. Consider this: Storage capacity is increasing by 40 percent every year, and while 2-TB disk drives are a commodity today, their performance characteristics -- such as seek time and reliability -- have not improved proportionately. Solid state drives will address this, revolutionizing storage within a few years; therefore, file systems will need to support hybrid environments.
The Network Layer
In the network layer, a few critical shifts have taken place that are relevant to this discussion. IDE and SCSI have merged into SAS, crossing the 6-Gbps threshold. Infiniband and 10 GbE have improved network I/O over 100 times via Remote Direct Memory Access (RDMA) and TCP offload engines, and iSCSI is replacing FC SAN (Fibre Channel storage area network). Unlike traditional applications that use protocols such as NFS and CIFS, Internet scale applications naturally use HTTP(s) to access geographically distributed storage.
Compounding the capacity and network layer issues is that at the API layer, we still have the 30-year-old POSIX API designed for disk file systems. Its strength as a standard is also its weakness, as innovation is being held back by the requirement to support it. Strengthening this claim and illustrating the obvious need for innovation, note that application developers require new API standards for object storage and key-value pair access to data.
Features such as volume-management, global namespace, compression, encryption, clone/snapshot, de-duplication, RAID and remote site replication should be the responsibility of the file system. Primarily due to the evolutionary design of Linux, these features, however, are implemented outside the filesystem, making them inefficient and complicated.
The Kernel Space vs. User Space Debate
The long-running kernel space versus user space debate centered on performance concerns that are no longer valid. Context switches are no longer an issue because latency in other parts of the stack and in the network are the limiting factors. Emerging file systems implemented in user space should be taken seriously. Low level components like device drivers and the disk file system should reside inside the kernel, but it is time to implement the bulk of the code in user space.
For example, scalable implementations such as Hadoop and GlusterFS are entirely in user space. The Filesystem in User space (FUSE) interface enables these scalable implementations and the existence of over 100 FUSE-based file systems, both serious and fun, demonstrates the validity.
What's Changing and What's New
- File Systems for Direct Attached Storage (DAS). The Linux file system of choice is
Ext3 as it is by far the most stable option. However, it is now worth considering
Ext4 (with kernel v2.6.31 or higher) because it addresses limitations such as file and directory size limits, extent based allocation for efficiently storing large files, fast fsck (file system check) and journal check-summing. However, Ext4 is only a stop-gap solution.
The Linux community would welcome ZFS support for Linux, and Btrfs (B-tree filesystem) brings ZFS-like capabilities to the Linux kernel. Btrfs is more than just a disk filesystem as it manages software raid, volume management, cloning/snapshotting and compression, and it allows volumes to grow or shrink dynamically across multiple disks.
RAID 1+0 (or10) is a mirrored data set currently supported with RAID5/6 parity protection in development. Btrfs RAID can rebuild from a failed array faster than hardware RAID controllers since it only re-stripes used data blocks, and snapshots takes less space and time compared to device-mapper snapshots. Btrfs is positioned to replace Ext4 as the default filesystem for Linux. Chris Mason from Oracle is the primary contributor to this project, and its development is independent from ZFS.
- Network File Systems. Introduced in 1995,
NFSv3 is by far the most widely adopted
NAS protocol, supported by all server operating systems (except Microsoft Windows) and storage vendors. In 2000, NFSv4 introduced incremental enhancements but did not achieve widespread adoption. Things should be very different for NFSv4.1 (pNFS) as it brings a much needed parallel I/O architecture and also adds RDMA support for low-latency high-bandwidth I/O.
This is a major advancement for NFS, allowing users to move to a scale-out architecture. It's design,however, is based on a centralized metadata server that can limit scalability and potentially be a single point of failure. Mainstream adoption is still a few years out, but in the interim there are options that address the scalability issues of NFSv3/4. By integrating NFS on top of a proven clustered file system with virtual-IPs and round-robin DNS techniques, one can build a scalable NAS solution.
Examples include Oracle Lustre, IBM GPFS, and GlusterFS. Additionally, Isilon and Panasas have proprietary server side implementations, but the approach is similar. Samba also deserves a brief mention, as Version 4 is a major rewrite with full Active Directory domain controller capability, and a less chatty SMB2 protocol compared to v3. Version 4 is currently in alpha, but will scale across multiple servers using a clustered file system backend and virtual IP/DNS techniques.
- Clustered/Distributed File Systems. Clustered or distributed file systems bring a global namespace spanning multiple storage servers, intelligently spreading I/O and data to overcome the scalability limitations of NFS (and CIFS). Vendors implemented their own internal protocols; however, this is transparent to applications as shared storage is accessed via a standard POSIX interface and volumes appear as large shared storage.
Again Lustre, GPFS and GlusterFS are a few examples. The main architectural difference between these is how metadata is handled. Lustre uses a centralized metadata server, GPFS has a distributed metadata server model, and GlusterFS uses a no-metadata model. Understanding how metadata is handled is a key criterion in decision-making when selecting a clustered file system, but that is the subject of a different article.
Another difference is Lustre and GPFS are implemented in the kernel, while GlusterFS is in user space. These examples implement the key storage stack features mentioned earlier within the file system section above. All of these file systems scale to multiple petabytes.
- Distributed Object Storage/Key-Value Pair. Many modern applications require simple object API access to storage such as get/put, get/put property, lock/unlock and perhaps a few more APIs. Object access allows memory and disk to be treated as persistent blocks of information via serialization.
Most data processing functionality (i.e. XML parsing, hash table, text search, etc.) has moved to higher level application libraries. Object storage APIs are also suitable for building "NoSQL" scalable databases.
There is no standard for object storage access today, although WebDAV, Hadoop, Apache Cassandra, and Amazon proprietary S3 follow the model. These are good options if building a new application, but until standardization occurs, POSIX APIs will be a necessary evil and a multiprotocol approach is the way to go.
Data storage is expected to grow to over 1,800 exabytes by 2012. To prepare for this massive growth, the key challenges that any organization will need to address are scalability, parallelism, and ensuring the ability to run new workload types.
Emerging Linux file system technologies are worth serious consideration, as they are positioned to solve these issues while at the same time commoditizing and transforming the storage market.
Anand Babu (AB) Periasamy is CTO and cofounder of Gluster, which provides an open source clustered file system that runs on commodity off-the-shelf hardware.