Survey of distributed file system

By Samuel Turner,2014-10-25 23:03
8 views 0
Survey of distributed file system

    Survey of distributed file system

    1. Coda

    Table of Contents

    ; Introduction

    ; Publications


     The Coda distributed file system is a state of the art experimental file system developed in the group of M. Satyanarayanan at Carnegie Mellon University. Numerous people contributed to Coda which now incorporates many features not found in other systems: Mobile Computing

    ; disconnected operation for mobile clients

    o reintegration of data from disconnected clients

    o bandwidth adaptation

    ; Failure Resilience

    o read/write replication servers

    o resolution of server/server conflicts

    o handles of network failures which partition the servers

    o handles disconnection of clients client

    ; Performance and scalability

    o client side persistent caching of files, directories and attributes for high


    o write back caching

    ; Security

    o kerberos like authentication

    o access control lists (ACL's)

    ; Well defined semantics of sharing

    ; Freely available source code

    Coda was originally implemented on Mach 2.6 and has recently been ported to Linux, NetBSD and FreeBSD.Michael Callahan ported a large portion of Coda to Windows 95, and we are studying Windows NT to understand the feasibility of porting Coda to NT Currently, our efforts are on ports and on making the system more robust. A few new features arc being implemented (write-back caching and cells for example), and in several areas, components of Coda are being reorganized.


    ; Braam, P. J. The Coda Distributed File System. Linux Journal, #50 June 1998

    ; Satyanarayanan, M. Fundamental Challenges in Mobile Computing. Fifteenth

    ACM Symposium on Principles of Distributed Computing May 1996,

    Philadelphia, PA

    ; Satyanarayanan, M. Mobile Information Access. IEEE Personal

    Communications, Vol. 3, No. 1, February 1996

    ; Noble, B., Satyanarayanan, M. A Research Status Report on Adaptation for

    Mobile Data Access. SIGMOD Record, Vol. 24, No. 4, December 1995.

    ; Satyanarayanan, M. Scalable, Secure, and Highly Available Distributed File

    Access. IEEE ComputerMay 1990, Vol. 23, No. 5

    ; Satyanarayanan, M. Coda: A Highly Available File System for a Distributed

    Workstation Environment. Proceedings of the Second IEEE Workshop on

    Workstation Operating Systems Sep. 1989, Pacific Grove, CA

    ; Satyanarayanan, M. Autonomy or Interdependence in Distributed Systems?

    Third ACM SIGOPS European Workshop Sep. 1988, Cambridge, England

    ; Satyanarayanan, M., Kistler, J.J., Siegel, E.H. Coda: A Resilient Distributed File

    System. IEEE Workshop on Workstation Operating Systems, Nov. 1987,

    Cambridge, MA

2. Distributed File System(Microsoft)

    Table of Contents

    ; Introduction

    ; DFS Terminology


    DFS allows administrators to group shared folders located on different servers by transparently connecting them to one or more DFS namespaces. A DFS namespace is a virtual view of shared folders in an organization. Using the DFS tools, an administrator selects which shared folders to present in the namespace, designs the hierarchy in which those folders appear, and determines the names that the shared folders show in the namespace. When a user views the namespace, the folders appear to reside on a single, high-capacity hard disk. Users can navigate the namespace without needing to know the server names or shared folders hosting the data. DFS also provides other benefits, including the following:

    ; Simplified data migration

    DFS simplifies the process of moving data from one file server to another.

    ; Increased availability of file server data

    in the event of a server failure, DFS refers client computers to the next available server,

    so users can always access shared folders without interruption

; Load sharing

    DFS provides a degree of load sharing by mapping a given logical name to shared folders

    on multiple file servers.

    ; Security integration

    Administrators do not need to configure additional security for DFS namespaces

    because file and folder security is enforced by existing the NTFS file system and shared

    folder permissions on each target.

    DFS Terminology

     The following terms are used to describe the basic components of DFS:

    ; DFS namespace

    A virtual view of shared folders on different servers as provided by DFS. A DFS

    namespace consists of a root and many links and targets. The namespace starts with a

    root that maps to one or more root targets. Below the root are links that map to their

    own targets.

    ; DFS link

    A component in a DFS path that lies below the root and maps to one or more link


    ; DFS path

    Any Universal Naming Convention (UNC) path that starts with a DFS root. ; DFS root

    The starting point of the DFS namespace. The root is often used to refer to the

    namespace as a whole. A root maps to one or more root targets, each of which

    corresponds to a shared folder on a separate server. The DFS root must reside on an

    NTFS volume. A DFS root has one of the following formats: \\ServerName\RootName or


    ; domain-based DFS namespace

    A DFS namespace that has configuration information stored in Active Directory. The

    path to access the root or a link starts with the host domain name. A domain-based DFS

    root can have multiple root targets, which offers fault tolerance and load sharing. ; link referral

    A type of referral that contains a list of link targets for a particular link. ; link target

    The mapping destination of a link. A link target can be any UNC path. For example, a link

    target could be a shared folder or another DFS path.

    ; Referral

    A list of targets, transparent to the user, which a DFS client receives from DFS when the

    user is accessing a root or a link in the DFS namespace. The referral information is

    cached on the DFS client for a time period specified in the DFS configuration. ; root referral

    A type of referral that contains a list of root targets for a particular root. ; root target

    A physical server that hosts a DFS namespace. A domain-based DFS root can have

    multiple root targets, whereas a stand-alone DFS root can only have one root target.

    Root targets are also called root servers.

    ; stand-alone DFS namespace

    A DFS namespace whose configuration information is stored locally in the registry of the

    root server. The path to access the root or a link starts with the root server name. A

    stand-alone DFS root has only one root target. Stand-alone roots are not fault tolerant;

    when the root target is unavailable, the entire DFS namespace is inaccessible. You can

    make stand-alone DFS roots fault tolerant by creating them on server clusters.

3. Fraunhofer Parallel file System(FhGFS)

    Table of Contents

    ; Introduction


     Fraunhofer Parallel file System(FhGFS) is the new parallel File System from the Fraunhofer Competence Center for High Performance Computing. FhGfs is written from scratch and incorporate results from our experience with existing systems. FhGfs is a fully POSIX compliant, scalable file system with nice features like:

    ; Distributed metadata:

    Although parallel file systems usually distribute the file contents over multiple storage

    nodes, the metadata is often bound to single nodes. This leads to performance

    bottlenecks and limited fault tolerance. FhGFS distributes the metadata across all the

    available storage nodes in a special way that keeps the lookup time at a minimum.

    ; Easy installation:

    FhGFS requires no kernel patches, is able to connect storage nodes and servers with

    zero-config and allows you to add more clients and storage nodes to the running system

    whenever you want it.

    ; Support for high performance technologies:

    FhGFS is built on a scalable multithreaded architecture with native InfiniBand support.

    Storage nodes can serve InfiniBand and Ethernet clients at the same time and

    automatically switches to a redundant connection path in case any of them fails.

4. Lustre

    Table of Contents

    ; Introduction

    ; Architecture

    ; Features and Benefits

    ; Publications


     Lustre is an object-based, distributed file system, generally used for large scale cluster computing. The name Lustre is a blend of the words Linux and cluster. The project aims to provide a file system for clusters of tens of thousands of nodes with petabytes of storage capacity, without compromising speed or security. Lustre is available under the GNU GPL.

    Lustre file systems can support up to tens of thousands of client systems, petabytes (PBs) of storage and hundreds of gigabytes per second (GB/s) of I/O throughput. Businesses ranging from Internet service providers to large financial institutions deploy Lustre file systems in their data centers. Due to the high scalability of Lustre file systems, Lustre deployments are popular in the oil and gas, manufacturing, rich media and finance sectors.


     A Lustre file system has three major functional units:

    ; A single metadata target (MDT) per filesystem that stores metadata, such as filenames,

    directories, permissions, and file layout, on the metadata server (MDS)

    ; One or more object storage targets (OSTs) that store file data on one or more object

    storage servers (OSSes). Depending on the server’s hardware, an OSS typically serves

    between two and eight targets, each target a local disk filesystem up to 8 terabytes (TBs)

    in size. The capacity of a Lustre file system is the sum of the capacities provided by the


    ; Client(s) that access and use the data. Lustre presents all clients with standard POSIX

    semantics and concurrent read and write access to the files in the filesystem.

    The MDT, OST, and client can be on the same node or on different nodes, but in typical installations, these functions are on separate nodes with two to four OSTs per OSS node communicating over a network. Lustre supports several network types, including Infiniband, TCP/IP on Ethernet, Myrinet, Quadrics, and other proprietary technologies. Lustre can take advantage of remote direct memory access (RDMA) transfers, when available, to improve throughput and reduce CPU usage.

    The storage attached to the servers is partitioned, optionally organized with logical volume management (LVM) and/or RAID, and formatted as file systems. The Lustre OSS and MDS servers read, write, and modify data in the format imposed by these file systems.

    An OST is a dedicated filesystem that exports an interface to byte ranges of objects for read/write operations. An MDT is a dedicated filesystem that controls file access and tells clients which object(s) make up a file. MDTs and OSTs currently use a modified version of ext3 to store data. In the future, Sun's ZFS/DMU will also be used to store data.

    When a client accesses a file, it completes a filename lookup on the MDS. As a result, a file is created on behalf of the client or the layout of an existing file is returned to the client. For read or

    writer operations, the client then passes the layout to a logical object volume (LOV), which maps the offset and size to one or more objects, each residing on a separate OST. The client then locks the file range being operated on and executes one or more parallel read or write operations directly to the OSTs. With this approach, bottlenecks for client-to-OST communications are eliminated, so the total bandwidth available for the clients to read and write data scales almost linearly with the number of OSTs in the filesystem.

    Clients do not directly modify the objects on the OST filesystems, but, instead, delegate this task to OSSes. This approach ensures scalability for large-scale clusters and supercomputers, as well as improved security and reliability. In contrast, shared block-based filesystems such as Global File System and OCFS must allow direct access to the underlying storage by all of the clients in the filesystem and risk filesystem corruption from misbehaving/defective clients. Features and Benefits

    Lustre's unprecedented scalability, bulletproof reliability, and proven performance help you meet the uptime requirements of your most demanding business and national-security applications.

    Key Benefits

    ; Unparalleled scalability

    ; Proven reliability

    ; High performance

    ; Multi-platform support

    ; Cost-effective


    ; Lustre File System: High-Performance Storage Architecture and Scalable Cluster File


5. Parallel Virtual File System(PVFS,PVFS2)

    Table of Contents

    ; Introduction

    ; Features

    ; Publications


    The Parallel Virtual File System (PVFS) is an Open Source parallel file system. A parallel file system is a type of distributed file system that distributes file data across multiple servers and provides for concurrent access by multiple tasks of a parallel application. PVFS was designed for

use in large scale cluster computing.

    PVFS focuses on high performance access to large data sets. It consists of a server process and a client library, both of which are written entirely of user-level code. A Linux kernel module and pvfs-client process allow the file system to be mounted and used with standard utilities. The client library provides for high performance access via the message passing interface (MPI).

    PVFS is being jointly developed between The Parallel Architecture Research Laboratory at Clemson University and the Mathematics and Computer Science Division at Argonne National Laboratory, and the Ohio Supercomputer Center.

    PVFS development has been funded by NASA Goddard Space Flight Center, DOE Argonne National Laboratory, NSF PACI and HECURA programs, and other government and private agencies.


    PVFS brings state-of-the-art parallel I/O concepts to production parallel systems. It is designed to scale to petabytes of storage and provide access rates at 100s of GB/s. Its main features are:

    ; Performance

    PVFS is designed to provide high performance for parallel applications, where

    concurrent, large IO and many file accesses are common. PVFS provides dynamic

    distribution of IO and metadata, avoiding single points of contention, and allowing for

    scaling to high-end terascale and petascale systems.

    ; Reliability

    PVFS relaxes the POSIX consistency semantics where necessary to improve on stability

    and performance. Cluster file systems that enforce POSIX consistency require stateful

    clients with locking subsystems, reducing the stability of the system in the face of

    failures. These systems can be difficult to maintain due to overhead of lock


    PVFS clients are stateless allowing for effortless failure recovery and easy integration

    with industry standard high-availability systems. Read more on the PVFS consistency

    model and setting up PVFS for high-availability.

    ; Optimized MPI-IO support

    PVFS is designed to support a number of access models, from collective I/O to

    independent I/O as well as non-contiguous and structured access patterns. PVFS

    provides an object-based, stateless client interface, leading to optimizations for

    metadata operations within MPI-IO. See the ROMIO site for further information.

    ; Hardware Independence

    PVFS operates on a wide variety of systems, including IA32, IA64, Opteron, PowerPC,

    Alpha, and MIPS. It is easily integrated into local or shared storage configurations, and

    provides support for high-end networking fabrics, such as Infiniband or Myrinet.

    ; Painless Deployment

    PVFS was designed to be easy to deploy and manage. It builds out-of-the-box on a

    wide-variety of Linux distributions, and has only a few required dependencies. PVFS

    builds and runs directly on your Linux installation, it does not require kernel patches or

    specific kernel versions.

    ; Research Platform

    PVFS is developed by a multi-institution team of parallel I/O, networking and storage

    experts. It embodies the expertise of designers who have worked for over a decade in

    the field of parallel I/O. PVFS continues to be used as a platform for active research in

    the parallel I/O field, the code is designed to be easy to augment for research purposes.

    Review the latest papers on research using PVFS.


    ; Ames Laboratory Demonstrates Ultra-Fast PVFS Transport. HPC Wire, 2007

    ; Resilient PVFS, Yes It Is Possible. ClusterMonkey (, 2005

    ; Avery Ching, Robert Ross, Wei-keng Liao, Lee Ward, Alok Choudhary. Noncontiguous

    locking techniques for parallel file systems. Proceedings of Supercomputing, 2007

    ; Ananth Devulapalli, Pete Wyckoff. File creation strategies in a distributed metadata file

    system. Proceedings of IPDPS'07, 2007

    ; Weikuan K. Yu, Shuang Liang, Dhabaleswar K. Panda. High performance support of

    parallel virtual file system (PVFS2) over Quadrics. International Conference on

    Supercomputing (ICS '05), 2005

    ; Dean Hildebrand, Peter Honeyman. Exporting Storage Systems in a Scalable Manner

    with pNFS. 13th NASA Goddard Conference on Mass Storage Systems and Technologies,


    6. Ceph

    Table of Contents

    ; Introduction

    ; Features

    ; Publications


    Ceph is a distributed network file system designed to provide excellent performance, reliability, and scalability. Ceph fills two significant gaps in the array of currently available file systems:

    ; Robust, open-source distributed storage Ceph is released under the terms of the

    LGPL, which means it is free software (as in speech and beer). Ceph will provide a

    variety of key features that are generally lacking from existing open-source file systems,

    including seamless scalability (the ability to simply add disks to expand volumes),

    intelligent load balancing, and efficient, easy to use snapshot functionality.

    ; Scalability Ceph is built from the ground up to seamlessly and gracefully scale from

    gigabytes to petabytes and beyond. Scalability is considered in terms of workload as

    well as total storage. Ceph is designed to handle workloads in which tens thousands of

    clients or more simultaneously access the same file, or write to the same

    directoryusage scenarios that bring typical enterprise storage systems to their knees.


    Here are some of the key features that make Ceph different from existing file systems that

    you may have worked with:

    ; Seamless scaling A Ceph filesystem can be seamlessly expanded by simply adding

    storage nodes (OSDs). However, unlike most existing file systems, Ceph proactively

    migrates data onto new devices in order to maintain a balanced distribution of data.

    This effectively utilizes all available resources (disk bandwidth and spindles) and avoids

    data hot spots (e.g., active data residing primarly on old disks while newer disks sit

    empty and idle).

    ; Strong reliability and fast recovery All data in Ceph is replicated across multiple OSDs.

    If any OSD fails, data is automatically re-replicated to other devices. However, unlike

    typical RAID systems, the replicas for data on each disk are spread out among a large

    number of other disks, and when a disk fails, the replacement replicas are also

    distributed across many disks. This allows recovery to proceed in parallel (with dozens

    of disks copying to dozens of other disks), removing the need for explicit “spare” disks

    (which are effectively wasted until they are needed) and preventing a single disk from

    becoming a “RAID rebuild” bottleneck.

    ; Adaptive MDS The Ceph metadata server (MDS) is designed to dynamically adapt its

    behavior to the current workload. As the size and popularity of the file system

    hierarchy changes over time, that hierarchy is dynamically redistributed among

    available metadata servers in order to balance load and most effectively use server

    resources. (In contrast, current file systems force system administrators to carve their

    data set into static “volumes” and assign volumes to servers. Volume sizes and

    workloads inevitably shift over time, forcing administrators to constantly shuffle data

    between servers or manually allocate new resources where they are currently needed.)

    Similarly, if thousands of clients suddenly access a single file or directory, that metadata

    is dynamically replicated across multiple servers to distribute the workload.


    ; Sage A. Weil. Ceph: Reliable, Scalable, and High-Performance Distributed Storage.

    Ph.D. thesis, University of California, Santa Cruz, December, 2007.

    ; Sage A. Weil, Andrew W. Leung, Scott A. Brandt, Carlos Maltzahn. RADOS: A Fast,

    Scalable, and Reliable Storage Service for Petabyte-scale Storage Clusters. Petascale

    Data Storage Workshop SC07, November, 2007.

    ; Sage Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, Carlos Maltzahn, Ceph: A

    Scalable, High-Performance Distributed File System, Proceedings of the 7th Conference

    on Operating Systems Design and Implementation (OSDI ‘06), November 2006.

    ; Sage Weil, Scott A. Brandt, Ethan L. Miller, Carlos Maltzahn, CRUSH: Controlled,

    Scalable, Decentralized Placement of Replicated Data, Proceedings of SC ‘06, November


    ; Sage Weil, Kristal Pollack, Scott A. Brandt, Ethan L. Miller, Dynamic Metadata

    Management for Petabyte-Scale File Systems, Proceedings of the 2004 ACM/IEEE

    Conference on Supercomputing (SC ‘04), November 2004

    ; Qin Xin, Ethan L. Miller, Thomas Schwarz, Evaluation of Distributed Recovery in

    Large-Scale Storage Systems, Proceedings of the 13th IEEE International Symposium on

    High Performance Distributed Computing (HPDC 2004), June 2004, pages 172-181.

7. dCache

    Table of Contents

    ; Introduction

    ; Features

    ; Publications


    The goal of this project is to provide a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods. Depending on the Persistency Model, dCache provides methods for exchanging data with backend (tertiary) Storage Systems as well as space management, pool attraction, dataset replication, hot spot determination and recovery from disk or node failures. Connected to a tertiary storage system, the cache simulates unlimited direct access storage space. Data exchanges to and from the underlying HSM are performed automatically and invisibly to the user. Filesystem namespace operations may be performed through a standard nfs(2) interface.


    ; supports requesting data from a tertiary storage system

    A tertiary storage system typically uses a robotic tape system, where data is stored on a

    tape from a library of available tapes, which must be loaded and unloaded using a tape

    robot. Tertiary storage systems typically have a higher initial cost, but can be extended

    cheaply by added additional tapes. This results in tertiary storage systems being

    popular where large amounts of data must be read.

    ; provides many transfer protocols (allowing users to read and write to data)

    ; hot-spot data migration

    In this process, dCache will detect when a few file are being requested very often. If

    this happens, dCache can make duplicate copies of the popular files on other

    computers. This allows the load to be spread across multiple machines, so increasing

Report this document

For any questions or suggestions please email