Survey of distributed file system
Table of Contents
The Coda distributed file system is a state of the art experimental file system developed in the group of M. Satyanarayanan at Carnegie Mellon University. Numerous people contributed to Coda which now incorporates many features not found in other systems: Mobile Computing
; disconnected operation for mobile clients
o reintegration of data from disconnected clients
o bandwidth adaptation
; Failure Resilience
o read/write replication servers
o resolution of server/server conflicts
o handles of network failures which partition the servers
o handles disconnection of clients client
; Performance and scalability
o client side persistent caching of files, directories and attributes for high
o write back caching
o kerberos like authentication
o access control lists (ACL's)
; Well defined semantics of sharing
; Freely available source code
Coda was originally implemented on Mach 2.6 and has recently been ported to Linux, NetBSD and FreeBSD.Michael Callahan ported a large portion of Coda to Windows 95, and we are studying Windows NT to understand the feasibility of porting Coda to NT Currently, our efforts are on ports and on making the system more robust. A few new features arc being implemented (write-back caching and cells for example), and in several areas, components of Coda are being reorganized.
; Braam, P. J. The Coda Distributed File System. Linux Journal, #50 June 1998
; Satyanarayanan, M. Fundamental Challenges in Mobile Computing. Fifteenth
ACM Symposium on Principles of Distributed Computing May 1996,
; Satyanarayanan, M. Mobile Information Access. IEEE Personal
Communications, Vol. 3, No. 1, February 1996
; Noble, B., Satyanarayanan, M. A Research Status Report on Adaptation for
Mobile Data Access. SIGMOD Record, Vol. 24, No. 4, December 1995.
; Satyanarayanan, M. Scalable, Secure, and Highly Available Distributed File
Access. IEEE ComputerMay 1990, Vol. 23, No. 5
; Satyanarayanan, M. Coda: A Highly Available File System for a Distributed
Workstation Environment. Proceedings of the Second IEEE Workshop on
Workstation Operating Systems Sep. 1989, Pacific Grove, CA
; Satyanarayanan, M. Autonomy or Interdependence in Distributed Systems?
Third ACM SIGOPS European Workshop Sep. 1988, Cambridge, England
; Satyanarayanan, M., Kistler, J.J., Siegel, E.H. Coda: A Resilient Distributed File
System. IEEE Workshop on Workstation Operating Systems, Nov. 1987,
2. Distributed File System(Microsoft)
Table of Contents
; DFS Terminology
DFS allows administrators to group shared folders located on different servers by transparently connecting them to one or more DFS namespaces. A DFS namespace is a virtual view of shared folders in an organization. Using the DFS tools, an administrator selects which shared folders to present in the namespace, designs the hierarchy in which those folders appear, and determines the names that the shared folders show in the namespace. When a user views the namespace, the folders appear to reside on a single, high-capacity hard disk. Users can navigate the namespace without needing to know the server names or shared folders hosting the data. DFS also provides other benefits, including the following:
; Simplified data migration
DFS simplifies the process of moving data from one file server to another.
; Increased availability of file server data
in the event of a server failure, DFS refers client computers to the next available server,
so users can always access shared folders without interruption
; Load sharing
DFS provides a degree of load sharing by mapping a given logical name to shared folders
on multiple file servers.
; Security integration
Administrators do not need to configure additional security for DFS namespaces
because file and folder security is enforced by existing the NTFS file system and shared
folder permissions on each target.
The following terms are used to describe the basic components of DFS:
; DFS namespace
A virtual view of shared folders on different servers as provided by DFS. A DFS
namespace consists of a root and many links and targets. The namespace starts with a
root that maps to one or more root targets. Below the root are links that map to their
; DFS link
A component in a DFS path that lies below the root and maps to one or more link
; DFS path
Any Universal Naming Convention (UNC) path that starts with a DFS root. ; DFS root
The starting point of the DFS namespace. The root is often used to refer to the
namespace as a whole. A root maps to one or more root targets, each of which
corresponds to a shared folder on a separate server. The DFS root must reside on an
NTFS volume. A DFS root has one of the following formats: \\ServerName\RootName or
; domain-based DFS namespace
A DFS namespace that has configuration information stored in Active Directory. The
path to access the root or a link starts with the host domain name. A domain-based DFS
root can have multiple root targets, which offers fault tolerance and load sharing. ; link referral
A type of referral that contains a list of link targets for a particular link. ; link target
The mapping destination of a link. A link target can be any UNC path. For example, a link
target could be a shared folder or another DFS path.
A list of targets, transparent to the user, which a DFS client receives from DFS when the
user is accessing a root or a link in the DFS namespace. The referral information is
cached on the DFS client for a time period specified in the DFS configuration. ; root referral
A type of referral that contains a list of root targets for a particular root. ; root target
A physical server that hosts a DFS namespace. A domain-based DFS root can have
multiple root targets, whereas a stand-alone DFS root can only have one root target.
Root targets are also called root servers.
; stand-alone DFS namespace
A DFS namespace whose configuration information is stored locally in the registry of the
root server. The path to access the root or a link starts with the root server name. A
stand-alone DFS root has only one root target. Stand-alone roots are not fault tolerant;
when the root target is unavailable, the entire DFS namespace is inaccessible. You can
make stand-alone DFS roots fault tolerant by creating them on server clusters.
3. Fraunhofer Parallel file System(FhGFS)
Table of Contents
Fraunhofer Parallel file System(FhGFS) is the new parallel File System from the Fraunhofer Competence Center for High Performance Computing. FhGfs is written from scratch and incorporate results from our experience with existing systems. FhGfs is a fully POSIX compliant, scalable file system with nice features like:
; Distributed metadata:
Although parallel file systems usually distribute the file contents over multiple storage
nodes, the metadata is often bound to single nodes. This leads to performance
bottlenecks and limited fault tolerance. FhGFS distributes the metadata across all the
available storage nodes in a special way that keeps the lookup time at a minimum.
; Easy installation:
FhGFS requires no kernel patches, is able to connect storage nodes and servers with
zero-config and allows you to add more clients and storage nodes to the running system
whenever you want it.
; Support for high performance technologies:
FhGFS is built on a scalable multithreaded architecture with native InfiniBand support.
Storage nodes can serve InfiniBand and Ethernet clients at the same time and
automatically switches to a redundant connection path in case any of them fails.
Table of Contents
; Features and Benefits
Lustre is an object-based, distributed file system, generally used for large scale cluster computing. The name Lustre is a blend of the words Linux and cluster. The project aims to provide a file system for clusters of tens of thousands of nodes with petabytes of storage capacity, without compromising speed or security. Lustre is available under the GNU GPL.
Lustre file systems can support up to tens of thousands of client systems, petabytes (PBs) of storage and hundreds of gigabytes per second (GB/s) of I/O throughput. Businesses ranging from Internet service providers to large financial institutions deploy Lustre file systems in their data centers. Due to the high scalability of Lustre file systems, Lustre deployments are popular in the oil and gas, manufacturing, rich media and finance sectors.
A Lustre file system has three major functional units:
; A single metadata target (MDT) per filesystem that stores metadata, such as filenames,
directories, permissions, and file layout, on the metadata server (MDS)
; One or more object storage targets (OSTs) that store file data on one or more object
storage servers (OSSes). Depending on the server’s hardware, an OSS typically serves
between two and eight targets, each target a local disk filesystem up to 8 terabytes (TBs)
in size. The capacity of a Lustre file system is the sum of the capacities provided by the
; Client(s) that access and use the data. Lustre presents all clients with standard POSIX
semantics and concurrent read and write access to the files in the filesystem.
The MDT, OST, and client can be on the same node or on different nodes, but in typical installations, these functions are on separate nodes with two to four OSTs per OSS node communicating over a network. Lustre supports several network types, including Infiniband, TCP/IP on Ethernet, Myrinet, Quadrics, and other proprietary technologies. Lustre can take advantage of remote direct memory access (RDMA) transfers, when available, to improve throughput and reduce CPU usage.
The storage attached to the servers is partitioned, optionally organized with logical volume management (LVM) and/or RAID, and formatted as file systems. The Lustre OSS and MDS servers read, write, and modify data in the format imposed by these file systems.
An OST is a dedicated filesystem that exports an interface to byte ranges of objects for read/write operations. An MDT is a dedicated filesystem that controls file access and tells clients which object(s) make up a file. MDTs and OSTs currently use a modified version of ext3 to store data. In the future, Sun's ZFS/DMU will also be used to store data.
When a client accesses a file, it completes a filename lookup on the MDS. As a result, a file is created on behalf of the client or the layout of an existing file is returned to the client. For read or
writer operations, the client then passes the layout to a logical object volume (LOV), which maps the offset and size to one or more objects, each residing on a separate OST. The client then locks the file range being operated on and executes one or more parallel read or write operations directly to the OSTs. With this approach, bottlenecks for client-to-OST communications are eliminated, so the total bandwidth available for the clients to read and write data scales almost linearly with the number of OSTs in the filesystem.
Clients do not directly modify the objects on the OST filesystems, but, instead, delegate this task to OSSes. This approach ensures scalability for large-scale clusters and supercomputers, as well as improved security and reliability. In contrast, shared block-based filesystems such as Global File System and OCFS must allow direct access to the underlying storage by all of the clients in the filesystem and risk filesystem corruption from misbehaving/defective clients. Features and Benefits
Lustre's unprecedented scalability, bulletproof reliability, and proven performance help you meet the uptime requirements of your most demanding business and national-security applications.
; Unparalleled scalability
; Proven reliability
; High performance
; Multi-platform support
; Lustre File System: High-Performance Storage Architecture and Scalable Cluster File
5. Parallel Virtual File System(PVFS,PVFS2)
Table of Contents
The Parallel Virtual File System (PVFS) is an Open Source parallel file system. A parallel file system is a type of distributed file system that distributes file data across multiple servers and provides for concurrent access by multiple tasks of a parallel application. PVFS was designed for
use in large scale cluster computing.
PVFS focuses on high performance access to large data sets. It consists of a server process and a client library, both of which are written entirely of user-level code. A Linux kernel module and pvfs-client process allow the file system to be mounted and used with standard utilities. The client library provides for high performance access via the message passing interface (MPI).
PVFS is being jointly developed between The Parallel Architecture Research Laboratory at Clemson University and the Mathematics and Computer Science Division at Argonne National Laboratory, and the Ohio Supercomputer Center.
PVFS development has been funded by NASA Goddard Space Flight Center, DOE Argonne National Laboratory, NSF PACI and HECURA programs, and other government and private agencies.
PVFS brings state-of-the-art parallel I/O concepts to production parallel systems. It is designed to scale to petabytes of storage and provide access rates at 100s of GB/s. Its main features are:
PVFS is designed to provide high performance for parallel applications, where
concurrent, large IO and many file accesses are common. PVFS provides dynamic
distribution of IO and metadata, avoiding single points of contention, and allowing for
scaling to high-end terascale and petascale systems.
PVFS relaxes the POSIX consistency semantics where necessary to improve on stability
and performance. Cluster file systems that enforce POSIX consistency require stateful
clients with locking subsystems, reducing the stability of the system in the face of
failures. These systems can be difficult to maintain due to overhead of lock
PVFS clients are stateless allowing for effortless failure recovery and easy integration
with industry standard high-availability systems. Read more on the PVFS consistency
model and setting up PVFS for high-availability.
; Optimized MPI-IO support
PVFS is designed to support a number of access models, from collective I/O to
independent I/O as well as non-contiguous and structured access patterns. PVFS
provides an object-based, stateless client interface, leading to optimizations for
metadata operations within MPI-IO. See the ROMIO site for further information.
; Hardware Independence
PVFS operates on a wide variety of systems, including IA32, IA64, Opteron, PowerPC,
Alpha, and MIPS. It is easily integrated into local or shared storage configurations, and
provides support for high-end networking fabrics, such as Infiniband or Myrinet.
; Painless Deployment
PVFS was designed to be easy to deploy and manage. It builds out-of-the-box on a
wide-variety of Linux distributions, and has only a few required dependencies. PVFS
builds and runs directly on your Linux installation, it does not require kernel patches or
specific kernel versions.
; Research Platform
PVFS is developed by a multi-institution team of parallel I/O, networking and storage
experts. It embodies the expertise of designers who have worked for over a decade in
the field of parallel I/O. PVFS continues to be used as a platform for active research in
the parallel I/O field, the code is designed to be easy to augment for research purposes.
Review the latest papers on research using PVFS.
; Ames Laboratory Demonstrates Ultra-Fast PVFS Transport. HPC Wire, 2007
; Resilient PVFS, Yes It Is Possible. ClusterMonkey (clustermonkey.net), 2005
; Avery Ching, Robert Ross, Wei-keng Liao, Lee Ward, Alok Choudhary. Noncontiguous
locking techniques for parallel file systems. Proceedings of Supercomputing, 2007
; Ananth Devulapalli, Pete Wyckoff. File creation strategies in a distributed metadata file
system. Proceedings of IPDPS'07, 2007
; Weikuan K. Yu, Shuang Liang, Dhabaleswar K. Panda. High performance support of
parallel virtual file system (PVFS2) over Quadrics. International Conference on
Supercomputing (ICS '05), 2005
; Dean Hildebrand, Peter Honeyman. Exporting Storage Systems in a Scalable Manner
with pNFS. 13th NASA Goddard Conference on Mass Storage Systems and Technologies,
Table of Contents
Ceph is a distributed network file system designed to provide excellent performance, reliability, and scalability. Ceph fills two significant gaps in the array of currently available file systems:
; Robust, open-source distributed storage — Ceph is released under the terms of the
LGPL, which means it is free software (as in speech and beer). Ceph will provide a
variety of key features that are generally lacking from existing open-source file systems,
including seamless scalability (the ability to simply add disks to expand volumes),
intelligent load balancing, and efficient, easy to use snapshot functionality.
; Scalability — Ceph is built from the ground up to seamlessly and gracefully scale from
gigabytes to petabytes and beyond. Scalability is considered in terms of workload as
well as total storage. Ceph is designed to handle workloads in which tens thousands of
clients or more simultaneously access the same file, or write to the same
directory–usage scenarios that bring typical enterprise storage systems to their knees.
Here are some of the key features that make Ceph different from existing file systems that
you may have worked with:
; Seamless scaling — A Ceph filesystem can be seamlessly expanded by simply adding
storage nodes (OSDs). However, unlike most existing file systems, Ceph proactively
migrates data onto new devices in order to maintain a balanced distribution of data.
This effectively utilizes all available resources (disk bandwidth and spindles) and avoids
data hot spots (e.g., active data residing primarly on old disks while newer disks sit
empty and idle).
; Strong reliability and fast recovery — All data in Ceph is replicated across multiple OSDs.
If any OSD fails, data is automatically re-replicated to other devices. However, unlike
typical RAID systems, the replicas for data on each disk are spread out among a large
number of other disks, and when a disk fails, the replacement replicas are also
distributed across many disks. This allows recovery to proceed in parallel (with dozens
of disks copying to dozens of other disks), removing the need for explicit “spare” disks
(which are effectively wasted until they are needed) and preventing a single disk from
becoming a “RAID rebuild” bottleneck.
; Adaptive MDS — The Ceph metadata server (MDS) is designed to dynamically adapt its
behavior to the current workload. As the size and popularity of the file system
hierarchy changes over time, that hierarchy is dynamically redistributed among
available metadata servers in order to balance load and most effectively use server
resources. (In contrast, current file systems force system administrators to carve their
data set into static “volumes” and assign volumes to servers. Volume sizes and
workloads inevitably shift over time, forcing administrators to constantly shuffle data
between servers or manually allocate new resources where they are currently needed.)
Similarly, if thousands of clients suddenly access a single file or directory, that metadata
is dynamically replicated across multiple servers to distribute the workload.
; Sage A. Weil. Ceph: Reliable, Scalable, and High-Performance Distributed Storage.
Ph.D. thesis, University of California, Santa Cruz, December, 2007.
; Sage A. Weil, Andrew W. Leung, Scott A. Brandt, Carlos Maltzahn. RADOS: A Fast,
Scalable, and Reliable Storage Service for Petabyte-scale Storage Clusters. Petascale
Data Storage Workshop SC07, November, 2007.
; Sage Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, Carlos Maltzahn, Ceph: A
Scalable, High-Performance Distributed File System, Proceedings of the 7th Conference
on Operating Systems Design and Implementation (OSDI ‘06), November 2006.
; Sage Weil, Scott A. Brandt, Ethan L. Miller, Carlos Maltzahn, CRUSH: Controlled,
Scalable, Decentralized Placement of Replicated Data, Proceedings of SC ‘06, November
; Sage Weil, Kristal Pollack, Scott A. Brandt, Ethan L. Miller, Dynamic Metadata
Management for Petabyte-Scale File Systems, Proceedings of the 2004 ACM/IEEE
Conference on Supercomputing (SC ‘04), November 2004
; Qin Xin, Ethan L. Miller, Thomas Schwarz, Evaluation of Distributed Recovery in
Large-Scale Storage Systems, Proceedings of the 13th IEEE International Symposium on
High Performance Distributed Computing (HPDC 2004), June 2004, pages 172-181.
Table of Contents
The goal of this project is to provide a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods. Depending on the Persistency Model, dCache provides methods for exchanging data with backend (tertiary) Storage Systems as well as space management, pool attraction, dataset replication, hot spot determination and recovery from disk or node failures. Connected to a tertiary storage system, the cache simulates unlimited direct access storage space. Data exchanges to and from the underlying HSM are performed automatically and invisibly to the user. Filesystem namespace operations may be performed through a standard nfs(2) interface.
; supports requesting data from a tertiary storage system
A tertiary storage system typically uses a robotic tape system, where data is stored on a
tape from a library of available tapes, which must be loaded and unloaded using a tape
robot. Tertiary storage systems typically have a higher initial cost, but can be extended
cheaply by added additional tapes. This results in tertiary storage systems being
popular where large amounts of data must be read.
; provides many transfer protocols (allowing users to read and write to data)
; hot-spot data migration
In this process, dCache will detect when a few file are being requested very often. If
this happens, dCache can make duplicate copies of the popular files on other
computers. This allows the load to be spread across multiple machines, so increasing