I/O Report – ZFS Pools on xenon

Date: 2026-02-27 Host: xenon (Ubuntu 24.04.4 LTS) Tool: fio 3.36, ioengine=libaio, direct=1 (O_DIRECT, bypasses page cache) Duration: 30 seconds per test


Pools Under Test

Pool

Backing

Layout

Disks

Raw Size

Used

Benchmark Path

nvme

NVMe SSD

stripe (2 drives)

2x 932 GB SK Hynix P31

1.81 TiB

84%

/pool/nvme/benchmark

data0

HDD

3x mirror (6 drives) + L2ARC

6x 8 TB WD + 120 GB SSD cache

21.8 TiB

91%

/pool/data0/benchmark

data1

HDD

raidz1 (4 drives)

4x 14 TB WD

50.9 TiB

77%

/pool/data1/benchmark

data2

HDD

stripe (2 drives)

2x 14 TB WD

25.4 TiB

12%

/pool/data2/benchmark

Note: nvme and data2 are striped (no redundancy). data0 has mirror redundancy and an SSD read cache (L2ARC). data1 has single-parity raidz1.


Cross-Pool Summary

Test

Pattern

Block Size

Jobs

Depth

nvme (84%)

data0 (91%)

data1 (77%)

data2 (12%)

Seq. Write

write

1 MiB

1

1

317 MiB/s

127 MiB/s

187 MiB/s

240 MiB/s

Seq. Read

read

1 MiB

1

1

1837 MiB/s

88.8 MiB/s

240 MiB/s

262 MiB/s

Rand Read IOPS

randread

4 KiB

4

32

21,500

419

12,700

11,800

Rand Write IOPS

randwrite

4 KiB

4

32

48,300

8,803

15,600

16,300

Mixed Read BW

randrw 70/30

256 KiB

4

16

693 MiB/s

37.2 MiB/s

37 MiB/s

46 MiB/s

Mixed Write BW

randrw 70/30

256 KiB

4

16

300 MiB/s

16.8 MiB/s

17 MiB/s

20 MiB/s


Pool: nvme (NVMe SSD stripe, 84% used)

Results

Test

Bandwidth

IOPS

Avg Latency

Total I/O

Sequential Write (1 MiB, 1 job)

317 MiB/s (333 MB/s)

317

3.15 ms

9,522 MiB

Sequential Read (1 MiB, 1 job)

1837 MiB/s (1926 MB/s)

1837

0.54 ms

53.8 GiB

Random Read (4 KiB, 4j x 32d)

84.0 MiB/s (88.1 MB/s)

21,500

5.95 ms

2,520 MiB

Random Write (4 KiB, 4j x 32d)

189 MiB/s (198 MB/s)

48,300

2.65 ms

5,659 MiB

Mixed R/W (256 KiB, 4j x 16d)

R: 693 / W: 300 MiB/s

R: 2,770 / W: 1,198

~15.8 ms

R: 20.3 GiB + W: 8,987 MiB

Latency Percentiles

Test

p50

p95

p99

Sequential Write

3.15 ms (avg)

max 10.8 ms

Sequential Read

0.54 ms (avg)

max 20.5 ms

Random Read

4.1 ms

10.0 ms

14.1 ms

Random Write

1.9 ms

5.7 ms

9.2 ms

Mixed R (avg)

~15.8 ms

Mixed W (avg)

~16.7 ms

Observations

  • NVMe-backed and far faster than the HDD pools in every metric.

  • Sequential read throughput (1.8 GiB/s) and mixed workload numbers make this pool suitable for high-throughput pclean workloads.

  • Random IOPS are orders of magnitude higher than the HDD pools.

  • Utilization (84%) is approaching the caution zone. If the pool reaches >90%, the fragmentation effects seen on data0 could appear; keep free space above ~15–20% where possible.


Pool: data0 (HDD 3x mirror + L2ARC, 91% used)

Results

Test

Bandwidth

IOPS

Avg Latency

Total I/O

Sequential Write (1 MiB, 1 job)

127 MiB/s (134 MB/s)

127

7.8 ms

3,824 MiB

Sequential Read (1 MiB, 1 job)

88.8 MiB/s (93.2 MB/s)

88

11.2 ms

2,666 MiB

Random Read (4 KiB, 4j x 32d)

1.68 MiB/s (1.72 MB/s)

419

283 ms

49.2 MiB

Random Write (4 KiB, 4j x 32d)

34.4 MiB/s (36.1 MB/s)

8,803

14.1 ms

1,032 MiB

Mixed R/W (256 KiB, 4j x 16d)

R: 37.2 / W: 16.8 MiB/s

R: 148 / W: 67

~274 ms

R: 1,116 MiB + W: 505 MiB

Latency Percentiles

Test

p50

p95

p99

Notes

Sequential Write

7.84 ms (avg)

max 226 ms

Sequential Read

11.2 ms (avg)

max 386 ms

Dramatically slower than data1/data2 (240–262 MiB/s). High-utilization ZFS pools suffer severe fragmentation, forcing reads to chase scattered blocks.

Random Read

230 ms

531 ms

1,401 ms

Catastrophically slow – 30x worse than data1/data2 (~12k IOPS). No bimodal cache-hit pattern; at 91% the ARC is ineffective and nearly every read hits fragmented on-disk blocks.

Random Write

13.7 ms

31.3 ms

35.4 ms

Roughly half the IOPS of data1/data2. ZFS COW at 91% must hunt harder for free blocks.

Mixed R/W

~200 ms

~827 ms

~1.2 s

Similar throughput to data1 at the 256 KiB block size, but tail latency is severe (p99 > 1.2 s).

Comparison with Other HDD Pools

Test

data0 (91%)

data1 (77%)

data2 (12%)

data0 vs. data2

Seq. Write

127 MiB/s

187 MiB/s

240 MiB/s

47% slower

Seq. Read

88.8 MiB/s

240 MiB/s

262 MiB/s

66% slower

Random Read IOPS

419

12,700

11,800

28x slower

Random Write IOPS

8,803

15,600

16,300

46% slower

Mixed Read BW

37.2 MiB/s

37 MiB/s

46 MiB/s

19% slower

Mixed Write BW

16.8 MiB/s

17 MiB/s

20 MiB/s

16% slower

Mixed p95 lat

827 ms

827 ms

380 ms

2.2x higher

Observations

  • Random read IOPS drops from ~12k to 419 – a 28–30x degradation. This single metric disqualifies data0 for any I/O-sensitive workload.

  • Sequential read collapses to 88.8 MiB/s (66% slower than data2). Fragmentation scatters what should be sequential blocks across multiple disk seeks.

  • Sequential write degrades moderately (127 MiB/s, 47% penalty). ZFS COW must search harder for contiguous free extents at high fill.

  • Random writes are least affected (8,803 IOPS, ~46% loss). ZFS’s ZIL and write coalescing still provide some buffering.

  • Mixed workload shows a floor effect. The 256 KiB block size partially masks fragmentation, but per-I/O tail latency (p99 > 1.2 s) reveals the underlying illness.


Pool: data1 (HDD raidz1, 77% used)

Results

Test

Bandwidth

IOPS

Avg Latency

Total I/O

Sequential Write (1 MiB, 1 job)

187 MiB/s (196 MB/s)

187

5.3 ms

5,612 MiB

Sequential Read (1 MiB, 1 job)

240 MiB/s (251 MB/s)

239

4.2 ms

7,187 MiB

Random Read (4 KiB, 4j x 32d)

50 MiB/s (52 MB/s)

12,700

9.7 ms

1,489 MiB

Random Write (4 KiB, 4j x 32d)

61 MiB/s (64 MB/s)

15,600

7.9 ms

1,833 MiB

Mixed R/W (256 KiB, 4j x 16d)

R: 37 / W: 17 MiB/s

R: 148 / W: 67

~274 ms

R: 1,116 MiB + W: 505 MiB

Latency Percentiles

Test

p50

p95

p99

Notes

Sequential Write

5.34 ms (avg)

max 197 ms

Sequential Read

4.17 ms (avg)

max 160 ms

Random Read

125 us

396 us

451 ms

Bimodal – most reads from ARC cache (sub-ms), uncached reads hit disk (p99.5 = 566 ms, p99.9 = 1.28 s).

Random Write

7.1 ms

16.3 ms

19.3 ms

Mixed R/W

~199 ms

~827 ms

~1.2 s

Noticeably worse than data2 – higher utilization and raidz1 parity overhead amplify COW overhead under contention.

Comparison with data2

Test

data1

data2

Difference

Seq. Write

187 MiB/s

240 MiB/s

data2 is 28% faster

Seq. Read

240 MiB/s

262 MiB/s

data2 is 9% faster

Random Read IOPS

12,700

11,800

~equal (within noise)

Random Write IOPS

15,600

16,300

~equal

Mixed Read BW

37 MiB/s

46 MiB/s

data2 is 24% faster

Mixed Write BW

17 MiB/s

20 MiB/s

data2 is 18% faster

Mixed p95 lat

827 ms

380 ms

data2 has 2.2x lower tail latency

Observations

  • Sequential throughput: data2 is consistently faster (28% write, 9% read). data2 is a 2-disk stripe while data1 is raidz1 (4 disks, 1 parity) – raidz1 trades some write throughput for parity computation and single-disk fault tolerance.

  • Random IOPS: Both pools are comparable (~12–16k), suggesting similar underlying disk populations (both use 14 TB WD drives).

  • Mixed workload: data2 substantially outperforms data1, especially in tail latency (p95: 380 ms vs. 827 ms). Higher utilization (77% vs. 12%) and raidz1 parity overhead amplify write-amplification and COW overhead under contention.

  • Both pools are HDD-class: Neither approaches SSD/NVMe performance.


Pool: data2 (HDD stripe, 12% used)

Results

Test

Bandwidth

IOPS

Avg Latency

Total I/O

Sequential Write (1 MiB, 1 job)

240 MiB/s (251 MB/s)

239

4.2 ms

7,191 MiB

Sequential Read (1 MiB, 1 job)

262 MiB/s (275 MB/s)

262

3.8 ms

7,875 MiB

Random Read (4 KiB, 4j x 32d)

46 MiB/s (48 MB/s)

11,800

10.2 ms

1,379 MiB

Random Write (4 KiB, 4j x 32d)

64 MiB/s (67 MB/s)

16,300

7.6 ms

1,912 MiB

Mixed R/W (256 KiB, 4j x 16d)

R: 46 / W: 20 MiB/s

R: 183 / W: 81

~225 ms

R: 1,382 MiB + W: 613 MiB

Latency Percentiles

Test

p50

p95

p99

Notes

Sequential Write

4.17 ms (avg)

max 101 ms

Sequential Read

3.8 ms (avg)

max 75.7 ms

Random Read

135 us

363 us

354 ms

Heavily bimodal – most reads from ARC cache (sub-ms), uncached reads hit spinning disk (p99.5 = 566 ms).

Random Write

6.7 ms

17.7 ms

21.6 ms

ZFS TXG batching absorbs random writes, giving smoother latency than random reads.

Mixed R/W

~199 ms

~380 ms

~840 ms

High latency under mixed load – ZFS COW overhead + disk seeks when reads and writes compete.

Analysis for pclean

  • Sequential throughput (~240–262 MiB/s): This pool is a 2-disk stripe (no redundancy), which gives good throughput from striping across both drives. With 8 Dask workers writing concurrently, aggregate demand could reach ~1–2 GB/s and saturate this pool. Keep nworkers <= 4 and cube_chunksize moderate to avoid contention.

  • Random IOPS (~12–16k): Adequate for CASA table system metadata operations. The bimodal random read latency (sub-ms median, ~350 ms tail) shows ZFS ARC caching is effective for hot data but uncached accesses hit spinning-disk latency.

  • Mixed workload (46 + 20 MiB/s): Significant latency increase (p50 ~200 ms). This is the regime where Dask continuum parallelism operates – concurrent visibility reads overlapping with intermediate image writes. Continuum imaging with many workers will be I/O-limited rather than CPU-limited.


NVMe vs. HDD Reference Comparison

Metric

nvme pool

Best HDD (data2)

Speedup

Seq. Write

317 MiB/s

240 MiB/s

1.3x

Seq. Read

1837 MiB/s

262 MiB/s

7x

Random Read IOPS

21,500

11,800

1.8x

Random Write IOPS

48,300

16,300

3x

Mixed Read BW

693 MiB/s

46 MiB/s

15x

Mixed Write BW

300 MiB/s

20 MiB/s

15x

Compared to a bare NVMe SSD (no ZFS overhead):

Metric

nvme pool (ZFS)

Bare NVMe (typical)

Overhead

Seq. Write

317 MiB/s

3,000+ MiB/s

~9x loss from ZFS COW

Seq. Read

1837 MiB/s

5,000+ MiB/s

~2.7x

Random Read IOPS

21,500

500,000+

~23x

Random Write IOPS

48,300

300,000+

~6x


Key Findings

  1. ZFS fragmentation is catastrophic at 91% utilization. Performance degrades roughly linearly from 12% to 77%, then falls off a cliff approaching 91%. The ZFS best-practice threshold of ~80% maximum utilization is confirmed empirically.

  2. Clear utilization–performance ordering: data2 (12%) > data1 (77%) >> data0 (91%) for every metric except random IOPS where data1 and data2 are equivalent.

  3. The nvme pool dominates all HDD pools by 1.3–15x depending on the access pattern. Mixed workloads see the largest gap (15x).

  4. ZFS adds significant overhead even on NVMe. Sequential write is only 317 MiB/s (vs. ~3 GB/s bare), a ~9x penalty from COW, checksumming, and TXG commit. This is the cost of ZFS’s data integrity guarantees.

  5. All HDD pools are HDD-class. Neither data1 nor data2 approaches NVMe performance. For I/O-intensive pclean runs, local NVMe is always preferred.


Recommendations for pclean

Scenario

Preferred Pool

Max Workers

Notes

Working directory (imagename, local_directory)

nvme

CPU/memory limited

NVMe I/O unlikely to be first bottleneck.

Working directory (ZFS only)

data2 > data1

3–4

I/O contention becomes limiting before CPU.

Archival / final products

data1 or data2

N/A

Sequential write is sufficient for archival.

Never use for I/O-sensitive work

data0

1 (last resort)

28x worse random read; even single-worker runs will be I/O-bottlenecked.

Specific Guidance

  • Use local NVMe for imagename and local_directory; keep final products on ZFS.

  • Set local_directory to fast local storage so Dask spill-to-disk does not add ZFS latency.

  • Keep cube_chunksize moderate on HDD pools to avoid many tiny sub-cubes that increase metadata overhead.

  • Consider freeing space on data0 (target < 80% / ~17.6 TiB used) to recover usable performance.

  • ZFS recordsize tuning: If a pool is dedicated to imaging, recordsize=1M (matching dominant I/O block size) may improve sequential throughput.

  • Monitor nvme utilization. At 84% it is approaching the caution zone; crossing 90% risks the same fragmentation cliff seen on data0.