Why OVH is banking on OpenZFS
The OpenZFS Developer Summit took place on October 19-20 in San Francisco. Until now OpenZFS has only been used by a few IT professionals. However as it celebrates its ten year anniversary it is growing ever stronger. In fact the "gravity" film source files were 12 petabytes and they were all stored on ZFS (1), and OpenZFS has been integrated into the latest release of Ubuntu as a native file system. OVH started using OpenZFS in 2009 and the company sent two representatives from the Storage team to the Development Summit to learn about new developments and to offer to contribute to the « Live migration with Zmotion » project. OVH storage engineers, François Lesage and Alexandre Lecuyer provide an overview.
OVH offers a patch to migrate data with no downtime
The OpenZFS project derived from the ZFS project sponsored by Sun Microsystems (purchased by Oracle in 2009). The project is the result of a fork in 2005, supported by an independent community mainly consisting of former Sun employees. Other than Netflix, which uses OpenZFZ on its Titan (2) platform, there aren't many other major IT players claiming to use OpenZFS, even though it has now proven reliable.
Traditional storage vendors offer reliable and robust proprietary solutions, which have an undeniable advantage because they reassure users that there data is sustainable. “The downside,” explains François, “is the price. These types of systems rely on specific hardware, which significantly raises the cost per gigabyte and they are real black boxes in the sense that it's not possible to know how the code works nor is it possible to modify the code yourself.” Compatible with standard machines (x86 architecture), OpenZFS drastically reduces the cost of storage. In addition, the code is open so OVH is able to adapt and improve upon it. “Ten years after the launch of the project, OpenZFS has reached its maturity,” continues François. Its data integrity verification system, which prevents silent file corruption, is among the most effective. Moreover, OpenZFS offers a wealth of features: snapshot, hot and cold storage, etc." In 2007, OVH abandoned EXT3 to take advantage of ZFS and the company became interested in OpenZFS in 2009. The first OVH projects in production under OpenZFS were realised at the end of 2011 and this technology now underlies many OVH services including e-mail, web hosting, Classic VPS, NAS-HA, Dedicated Cloud and Backup Storage. “At the time, the technology was already mature and the expertise we acquired allowed us to effectively compete with proprietary storage systems.” François and Alexandre went to San Francisco to share this OpenZFS experience.“Our technical presentation covered the subject of data migration under OpenZFS. Our hosting activity requires us to continuously allocate and deallocate storage spaces (zpools), which causes fragmentation issues. As a consequence, we have to regularly perform data migrations. Unfortunately, due to constraints linked to the use of NFS in our infrastructures, OpenZFS doesn’t let us carry out these operations without downtime. We attempted to solve this problem through trial and error.” The patch written by Alexandre and named “Zmotion”, was reviewed at a community workshop which Matt Ahrens was attending (cofounder of the ZFS project). It was proposed for upstream commitment and is already available on Github. “We think that this will interest large companies because data migration is a reccurring subject on the community mailing-list”, explains Alexandre.
View the “Live Migration with Zmotion” presentation by OVH:
9 contributions which will make OpenZFS even better
“In general, the presentations given by the different contributors were of a very high technical level,” François and Alexandre report. They’ve made wonderful promises of progress for OpenZFS so that users will benefit in the short, medium and long term.”
1- Compressed ARC (Adaptive Replacement Cache)
OpenZFS operates with a two level cache system (ARC and L2ARC), wherein the most frequently accessed data is stored. The first level of cache draws its resources from RAM, while the second level is hosted on disks – usually SSDs. George Wilson from Delphix is aiming to make it possible to compress/decompress on the fly the files contained in RAM, the first level of cache. For example, the cache required for a .txt file would require three times less space. As a result it would become possible to increase the performance of cache without the need to add more hardware. Availabilty in OpenZF: short term.
2 - Discontiguous caching with ABD
OpenZFS’s own caching system, ARC, works in redundancy with the OS cache, like Page cache under Linux. This means that a file is actually cached twice: once in the OpenZFS cache and a second time in the OS cache. This results in a loss of space in the RAM. David Chen’s contribution (OSNexus) is intended to allow OpenZFS to use standard OS caching mechanisms. Availabilty in OpenZFS: mid term.
3 - Persistent L2ARC
The second level OpenZFS caching mechanism is particularly sophisticated. The choice of cached data (on SSDs in most cases since production is hosted on slower disks) is the result of the work of two competing algorithms. This caching mechanism has a weak point in that, in the event of a reboot, the cache must be rebuilt. This operation can take several hours (up to 24 hours in the case of an OVH shared hosting filer cache) before maximum performance is regained. Saso Kiselkov's contribution (Nexenta) preserves the cache and the hot cache even after a reboot, thanks to a modification of the format. Available in OpenZFS: short term on Illumos.
4 - Writeback cache
The idea behind Alex Aizman’s contribution (Nexenta) is to put in place a mechanism for write memory buffer which would be separate from the disks dedicated to logs. For example, in the case of input bursts, data is written to an SSD disk or a PCI card, then transferred asynchronously to the pool of disks. Availability in OpenZFS: unknown.
5 - Compressed Send and Receive
Dan Kimmel’s contribution (Delphix) consists of directly sending a compressed data stream during a backup or migration. This makes it possible to eliminate the decompression step when sending and the “recompression” step when receiving. Result: Bandwidth consumption is reduced, time to perform backup is accelerated and there's a lesser load on the CPU during backup. Available in OpenZFS: Short to medium term.
6 - Resumable ZFS send/receive
This contribution by Mathew Ahrens (Delphix), presented in Paris six months ago at the OpenZFS European Conference, makes it possible to resume a backup with a token in the event a network failure during a file transfer. Unlike today, when a backup is interrupted it is “restarted” from zero. This is already available on Illumos and FreeBSD and soon on Linux.
7 - Parity Declustered RAID-Z/Mirror
This contribution by Issac Huang (Intel) is based on extensive research in applied mathematics for the quicker reconstruction of RAIDZ (OpenZFS RAID) in case of disk failure. Nowadays, with large infrastructure capacity, resilvering a RAIDZ sometimes takes days, and it is possible that a second RAIDZ disk could malfunction prior to the completion of the RAID reconstruction. Should this happen, there is a great risk that data will be lost. By optimizing the distribution of the data blocks on the disks, this project – in R&D stage ¬¬– considerably reduces the time necessary to rebuild a RAID.
8 - Dedup Ceiling
Data duplication has existed for a long time in OpenZFS. The fact is, it is hardly used because it can be risky when handling a large number of files. The dedup table is stored in RAM and when it becomes too large, the response time of the deduplication system increases dramatically. Saso Kiselkov's contribution (Nexenta) is intended to store the dedup table on a dedicated disk with the ability to predefine the maximum size. Result: Duplication of files finally becomes useable! The impact on OVH shared web hosting could be considerable. For example, the core files of WordPress are replicated millions of times and deduplication provides some very significant space savings. Availabilty in OpenZFS: mid term.
9 - SPA Metadata Allocation Classes
Don Brady's contribution (Intel) is an opportunity to remember that the heart of OpenZFS is object storage, made usable as a file system and based on metadata objects to reconstruct a tree structure. Some objects are thus stored files while others are metadata (UNIX, ACL ....). And these objects are indiscriminately mixed in the ZFS pool. The goal of Don Brady's work is to make it possible to organize objects based on their nature and, for example, to assign the objects which are metadata to the SSD disks in order to increase the pool's performance. Furthermore, it proposes to associate devices (SATA, SSD, MVNe, disks…) with some types of metadata. Available in OpenZFS: unknown.