NFS vs CephFS vs GlusterFS: The Shared Storage Decision Matrix You Actually Need

Every infrastructure engineer eventually hits the same wall: you need storage that multiple servers can read and write simultaneously, and suddenly you’re drowning in acronyms. NFS? "Too simple." CephFS? "Too complex." GlusterFS? "Isn’t that dead?"

None of those takes are accurate. Each of these solutions occupies a distinct sweet spot, and picking the wrong one doesn’t just hurt performance — it creates operational nightmares that follow you for years. I’ve run all three in production across Kubernetes clusters, bare-metal HPC setups, and media pipelines, so this isn’t a sanitized vendor comparison. It’s the article I wish existed when I was evaluating them.

The Three Contenders at a Glance

Before we go deep, here’s the honest one-liner for each:

NFS — The veteran. Simple, fast, universally supported. Single point of failure unless you add HA around it. Scales vertically, not horizontally.

GlusterFS (GitHub: gluster/glusterfs) — The middle ground. Peer-to-peer, no metadata server, easy to understand. Genuinely struggles with small-file workloads and has had spotty community momentum since Red Hat’s involvement changed.

CephFS (GitHub: ceph/ceph) — The beast. A full distributed storage platform that happens to expose a POSIX filesystem. Operationally complex, but the only option that genuinely scales horizontally without architectural compromises.


Architecture: Where the Differences Actually Live

NFS

NFS is a client-server protocol. One machine exports a directory, other machines mount it. That’s it. The kernel NFS server (nfsd) is rock-solid and well-optimized after decades of production use.

The critical version split: NFSv3 is stateless (simpler, firewall-friendly, no lock recovery on crash), while NFSv4 and v4.1 (pNFS) are stateful with built-in security (Kerberos integration, compound operations, better WAN performance). NFSv4.2 adds server-side copy and sparse file support.

The fundamental limit: a single NFS server is a vertical scaling problem. You can throw faster SSDs and more RAM at it, but you cannot distribute the metadata load or the I/O across nodes without an external clustering layer like Pacemaker/Corosync or switching to NFS-Ganesha with a clustered backend.

GlusterFS

GlusterFS eliminates the dedicated metadata server by embedding metadata in file storage itself using a hashing algorithm (DHT — Distributed Hash Table). Every brick (a directory exported from a node) participates equally. There’s no single point of failure at the metadata level.

Volumes are composed of bricks in different topologies:

  • Distributed — files spread across bricks by hash (no redundancy)
  • Replicated — each file written to N bricks (HA, at cost of capacity)
  • Dispersed — erasure coding, similar to RAID 6
  • Distributed-Replicated — the most common production choice

The DHT approach is elegant for large files but punishes small-file workloads hard. Creating 10,000 small files means 10,000 hash lookups across the cluster, and the translator stack GlusterFS uses adds measurable latency per operation.

CephFS

Ceph is a different animal entirely. The foundation is RADOS — a distributed object store that handles replication, failure detection, and recovery autonomously. On top of RADOS you can run:

  • RBD (block devices, used by Kubernetes for PVCs)
  • RGW (S3/Swift-compatible object storage)
  • CephFS (POSIX filesystem)

CephFS uses one or more Metadata Server daemons (MDS) that manage the directory hierarchy and file metadata. The actual file data goes directly into RADOS objects. This separation means you can scale metadata and data independently — something neither NFS nor GlusterFS can do.

MDS can run in active-active mode for horizontal metadata scaling, or active-standby for HA. The CRUSH map algorithm determines where data lands, and it’s deterministic — no central lookup service required.


Benchmark Reality Check

These numbers come from a reproducible test environment: three nodes, each with a 4-core VM, 16 GB RAM, and NVMe SSDs (Samsung 970 Pro, ~3,500 MB/s sequential). Network: 10 GbE with jumbo frames enabled. Tests run with fio and mdtest. No tuning heroics — default configs with reasonable mount options.

Sequential I/O (1 client, single large file)

Filesystem Seq Read (MB/s) Seq Write (MB/s)
NFS v4.2 1,850 1,420
CephFS (3 OSDs) 1,610 1,280
GlusterFS (3-replica) 1,540 890

NFS wins here because it’s doing the least work. One server, one destination, minimal protocol overhead. CephFS comes close because RADOS handles large I/O efficiently. GlusterFS write performance drops because every write goes to 3 replicas synchronously before acknowledging.

Random 4K IOPS (SSD backend, 4 jobs, queue depth 32)

Filesystem Rand Read (IOPS) Rand Write (IOPS)
NFS v4.2 48,200 31,500
CephFS (3 OSDs) 92,000 58,000
GlusterFS (3-replica) 21,000 9,400

This is where architecture matters. CephFS distributes 4K random I/O across all OSDs in parallel. NFS is limited to what one server can handle. GlusterFS gets crushed because its translator stack adds per-operation overhead, and replication across 3 nodes synchronously kills random write latency.

Small File Metadata (mdtest: create/stat/delete 100,000 files)

Filesystem Create (files/s) Stat (files/s) Delete (files/s)
NFS v4.2 12,400 38,000 14,200
CephFS (1 MDS) 8,900 29,000 11,400
CephFS (3 active MDS) 22,100 71,000 24,800
GlusterFS (3-replica) 1,200 3,100 1,600

GlusterFS getting smashed on metadata is not a benchmark anomaly — it’s the DHT tax. Every metadata operation goes through the full translator stack and often requires consulting multiple bricks. For workloads with lots of small files (AI training datasets, Git repos, package caches), GlusterFS is genuinely the wrong tool.

CephFS with multiple active MDS is the clear winner here at scale, but single-MDS CephFS is slower than NFS. The MDS is a serialization point until you scale it out.


Configuration Examples

NFS v4.2 Export (server-side /etc/exports)

# /etc/exports — clean, production-style NFS export
# Replace 10.0.1.0/24 with your actual client subnet

/data/shared  10.0.1.0/24(rw,sync,no_subtree_check,no_root_squash,fsid=0)
/data/media   10.0.1.0/24(ro,sync,no_subtree_check,anonuid=1000,anongid=1000)

Client mount in /etc/fstab:

# NFSv4.2 with async read-ahead and attribute caching
nfs-server:/data/shared  /mnt/shared  nfs4  vers=4.2,rsize=1048576,wsize=1048576,hard,intr,timeo=600,retrans=2,_netdev  0 0

The hard mount option is non-negotiable for anything except read-only media. With soft, a server hiccup causes silent I/O errors. Also: rsize=1048576 and wsize=1048576 — the default 65536 bytes leaves serious bandwidth on the table.

GlusterFS Volume Creation

# On all three nodes: install and peer probe
gluster peer probe node2
gluster peer probe node3

# Create a distributed-replicated volume (2 replicas × 3 shards = 6 bricks)
# Adjust brick paths to wherever your dedicated storage is
gluster volume create gv0 replica 2 \
  node1:/bricks/gv0 node2:/bricks/gv0 \
  node3:/bricks/gv0 node4:/bricks/gv0 \
  node5:/bricks/gv0 node6:/bricks/gv0

gluster volume start gv0

# Production tuning — enable client-side caching, increase read-ahead
gluster volume set gv0 performance.cache-size 512MB
gluster volume set gv0 performance.read-ahead on
gluster volume set gv0 performance.io-thread-count 32
gluster volume set gv0 network.ping-timeout 10

CephFS Setup Skeleton (cephadm)

# Bootstrap a Ceph cluster with cephadm (single command)
cephadm bootstrap --mon-ip 10.0.1.10 --initial-dashboard-user admin

# Add OSDs on each node (Ceph discovers available disks automatically)
ceph orch apply osd --all-available-devices

# Create a CephFS filesystem with separate data and metadata pools
ceph osd pool create cephfs_data 64
ceph osd pool create cephfs_metadata 32
ceph fs new cephfs cephfs_metadata cephfs_data

# Set replication factor (adjust to your node count)
ceph osd pool set cephfs_data size 3
ceph osd pool set cephfs_metadata size 3

# Scale MDS for active-active (N-1 active, 1 standby minimum)
ceph orch apply mds cephfs --placement="3 node1 node2 node3"
ceph fs set cephfs max_mds 2

Client mount with kernel CephFS:

# Get the admin key
ceph auth get-key client.admin | base64 > /etc/ceph/admin.keyring

# Mount
mount -t ceph 10.0.1.10:/ /mnt/cephfs \
  -o name=admin,secretfile=/etc/ceph/admin.keyring,mds_namespace=cephfs

Decision Matrix

Criteria NFS GlusterFS CephFS
Setup complexity Low Medium High
Operational overhead Low Medium High
Sequential throughput Excellent Good Excellent
Random IOPS (scale) Limited Poor Excellent
Small file / metadata Good Poor Excellent (multi-MDS)
Horizontal scalability None Medium Full
Built-in HA No (external) Yes Yes
Kubernetes CSI support NFS CSI driver Heketi (legacy) / CSI Rook-Ceph (first class)
POSIX compliance Full Mostly (xattrs limited) Full
Minimum nodes 1 3 3
Good for object storage No No Yes (RGW)

Pick NFS when: you have one or two dedicated storage servers, your workload is straightforward, and you value operational simplicity over everything else. NFS with a properly configured server (NVMe, tuned kernel parameters) is genuinely hard to beat in raw single-server throughput. For a homelab, a small team’s shared data, or a CI artifact cache, NFS is the right call 9 times out of 10.

Pick GlusterFS when: you need peer-to-peer redundancy across 3-6 nodes, your files are large (video, backups, databases with few open files), and you want something more HA than NFS without the operational burden of Ceph. Be honest about your small-file situation before committing.

Pick CephFS when: you’re running Kubernetes and want a unified storage platform (block + filesystem + object from one cluster), you have more than 5 nodes, or your workload demands horizontal scalability that no other option here can provide. Ceph with Rook is the current best-in-class Kubernetes storage story.


Gotchas

NFS: The "stale file handle" trap. If the NFS server reboots and clients were mid-write with soft mounts, you’ll get silent data corruption. Always use hard mounts for writable shares. And test your NFS server reboot behavior before prod — systemd service ordering has bitten many.

NFS: /etc/exports and IP wildcard matching. NFS access control is IP-based. If your clients are behind a NAT or use dynamic IPs, your access control is weaker than you think. In multi-tenant environments this is a real exposure. Layer Kerberos (sec=krb5p) or network-level isolation on top.

GlusterFS: The self-heal thundering herd. When you bring a node back after it’s been offline for a while, GlusterFS’s self-heal daemon starts comparing files across bricks. On large volumes, this creates a prolonged period of elevated I/O that can saturate network and disk. Don’t rush node replacements on large GlusterFS volumes without planning the recovery window.

GlusterFS: Split-brain. With 2-replica volumes (common for homelab setups), a network partition where both nodes can write independently creates split-brain that GlusterFS can’t auto-resolve. You’ll end up with .glusterfs/indices/xattrop/ entries piling up and manual intervention required. Always use 3+ replicas or arbiters.

CephFS: RADOS clock skew. Ceph requires NTP-synchronized clocks across all nodes within 0.05 seconds. Exceed that and monitors start forming a quorum problem that looks like random I/O errors. This has burned many people who set up Ceph in VMs without ensuring the VM host time sync is working. Use chrony and verify with chronyc tracking on every node before trusting the cluster with production data.

CephFS: MDS journal replay on crash. If an MDS crashes with dirty journal entries, the standby MDS takes over but has to replay the journal first. On a busy filesystem with a large journal, this can take minutes. During this window, filesystem operations block. Size your MDS journal (mds_log_max_segments) and monitor MDS lag.

All three: Network is the real bottleneck. Every shared storage solution here saturates before disk does on a modern setup. If you’re running 1 GbE and wondering why performance is bad — it’s the network. Jumbo frames (MTU 9000) alone can give you 15-20% improvement on NFS and GlusterFS for large sequential transfers.


Production-Ready Hardening

NFS with HA using Pacemaker

For serious NFS deployments that can’t afford a single point of failure, NFS-Ganesha plus Pacemaker/Corosync is the path. Ganesha replaces the kernel NFS server and supports clustered configurations with DRBD or a shared SAN backend.

# Ganesha config for a clustered export (/etc/ganesha/ganesha.conf)
NFS_CORE_PARAM {
    NFS_Port = 2049;
    MNT_Port = 20048;
    NLM_Port = 32803;
    Bind_addr = 10.0.1.50;  # floating VIP, managed by Pacemaker
}

EXPORT {
    Export_Id = 1;
    Path = /data/shared;
    Pseudo = /shared;
    Protocols = 4;
    Transports = TCP;
    Access_Type = RW;
    Squash = No_root_squash;
    SecType = sys;
    FSAL {
        Name = VFS;
    }
}

CephFS with Rook on Kubernetes

Rook-Ceph is the cleanest way to run Ceph in Kubernetes. It handles OSD provisioning, MDS deployment, and CSI driver registration automatically.

# rook-ceph filesystem definition
apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
  name: cephfs
  namespace: rook-ceph
spec:
  metadataPool:
    replicated:
      size: 3
  dataPools:
    - name: data0
      replicated:
        size: 3
  metadataServer:
    activeCount: 2          # active-active MDS for metadata throughput
    activeStandby: true     # always a hot standby
    resources:
      requests:
        cpu: "500m"
        memory: "1Gi"
      limits:
        cpu: "2000m"
        memory: "4Gi"

The matching StorageClass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-cephfs
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
  clusterID: rook-ceph
  fsName: cephfs
  pool: cephfs-data0
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
reclaimPolicy: Retain        # never Delete in production — you'll regret it
allowVolumeExpansion: true

The Honest Verdict

Most tutorials hedge their recommendations to death. Here’s mine without the hedging:

If you’re running Kubernetes at any meaningful scale — use Ceph via Rook. The day-1 pain is real, but day-365 operations are dramatically better than anything else. You get block, filesystem, and object from one control plane, the CSI integration is first-class, and Rook handles most of the operational toil.

If you’re running bare-metal workloads (HPC, media servers, shared data for a team) with 1-3 dedicated storage nodes — use NFS. Tune it properly, put a UPS on the storage server, monitor disk health, and call it done. Adding GlusterFS or Ceph for a three-node homelab is engineering for its own sake, not for the workload.

GlusterFS occupies a narrowing middle ground. It was the right answer circa 2016 when Ceph was harder to operate and NFS HA was more painful. Today, Rook-Ceph has eaten most of GlusterFS’s use cases from above, and simple NFS handles the small end. GlusterFS still earns its place in shops that have invested in it and understand it well — greenfield is a harder sell.

The worst outcome isn’t picking the "wrong" option — it’s under-provisioning network, using soft NFS mounts, running 2-replica GlusterFS in production, or deploying Ceph without monitoring. The filesystem matters less than the operating discipline you apply to it.

Leave a comment

👁 Views: 2,289 · Unique visitors: 1,646