I/O Report – ZFS Pools on `xenon`¶

Date: 2026-02-27 Host: xenon (Ubuntu 24.04.4 LTS) Tool: fio 3.36, ioengine=libaio, direct=1 (O_DIRECT, bypasses page cache) Duration: 30 seconds per test

Pools Under Test¶

Pool	Backing	Layout	Disks	Raw Size	Used	Benchmark Path
`nvme`	NVMe SSD	stripe (2 drives)	2x 932 GB SK Hynix P31	1.81 TiB	84%	`/pool/nvme/benchmark`
`data0`	HDD	3x mirror (6 drives) + L2ARC	6x 8 TB WD + 120 GB SSD cache	21.8 TiB	91%	`/pool/data0/benchmark`
`data1`	HDD	raidz1 (4 drives)	4x 14 TB WD	50.9 TiB	77%	`/pool/data1/benchmark`
`data2`	HDD	stripe (2 drives)	2x 14 TB WD	25.4 TiB	12%	`/pool/data2/benchmark`

Note: nvme and data2 are striped (no redundancy). data0 has mirror redundancy and an SSD read cache (L2ARC). data1 has single-parity raidz1.

Cross-Pool Summary¶

Test	Pattern	Block Size	Jobs	Depth	nvme (84%)	data0 (91%)	data1 (77%)	data2 (12%)
Seq. Write	write	1 MiB	1	1	317 MiB/s	127 MiB/s	187 MiB/s	240 MiB/s
Seq. Read	read	1 MiB	1	1	1837 MiB/s	88.8 MiB/s	240 MiB/s	262 MiB/s
Rand Read IOPS	randread	4 KiB	4	32	21,500	419	12,700	11,800
Rand Write IOPS	randwrite	4 KiB	4	32	48,300	8,803	15,600	16,300
Mixed Read BW	randrw 70/30	256 KiB	4	16	693 MiB/s	37.2 MiB/s	37 MiB/s	46 MiB/s
Mixed Write BW	randrw 70/30	256 KiB	4	16	300 MiB/s	16.8 MiB/s	17 MiB/s	20 MiB/s

Pool: `nvme` (NVMe SSD stripe, 84% used)¶

Results¶

Test	Bandwidth	IOPS	Avg Latency	Total I/O
Sequential Write (1 MiB, 1 job)	317 MiB/s (333 MB/s)	317	3.15 ms	9,522 MiB
Sequential Read (1 MiB, 1 job)	1837 MiB/s (1926 MB/s)	1837	0.54 ms	53.8 GiB
Random Read (4 KiB, 4j x 32d)	84.0 MiB/s (88.1 MB/s)	21,500	5.95 ms	2,520 MiB
Random Write (4 KiB, 4j x 32d)	189 MiB/s (198 MB/s)	48,300	2.65 ms	5,659 MiB
Mixed R/W (256 KiB, 4j x 16d)	R: 693 / W: 300 MiB/s	R: 2,770 / W: 1,198	~15.8 ms	R: 20.3 GiB + W: 8,987 MiB

Latency Percentiles¶

Test	p50	p95	p99
Sequential Write	3.15 ms (avg)	—	max 10.8 ms
Sequential Read	0.54 ms (avg)	—	max 20.5 ms
Random Read	4.1 ms	10.0 ms	14.1 ms
Random Write	1.9 ms	5.7 ms	9.2 ms
Mixed R (avg)	~15.8 ms	—	—
Mixed W (avg)	~16.7 ms	—	—

Observations¶

NVMe-backed and far faster than the HDD pools in every metric.
Sequential read throughput (1.8 GiB/s) and mixed workload numbers make this pool suitable for high-throughput pclean workloads.
Random IOPS are orders of magnitude higher than the HDD pools.
Utilization (84%) is approaching the caution zone. If the pool reaches >90%, the fragmentation effects seen on data0 could appear; keep free space above ~15–20% where possible.

Pool: `data0` (HDD 3x mirror + L2ARC, 91% used)¶

Results¶

Test	Bandwidth	IOPS	Avg Latency	Total I/O
Sequential Write (1 MiB, 1 job)	127 MiB/s (134 MB/s)	127	7.8 ms	3,824 MiB
Sequential Read (1 MiB, 1 job)	88.8 MiB/s (93.2 MB/s)	88	11.2 ms	2,666 MiB
Random Read (4 KiB, 4j x 32d)	1.68 MiB/s (1.72 MB/s)	419	283 ms	49.2 MiB
Random Write (4 KiB, 4j x 32d)	34.4 MiB/s (36.1 MB/s)	8,803	14.1 ms	1,032 MiB
Mixed R/W (256 KiB, 4j x 16d)	R: 37.2 / W: 16.8 MiB/s	R: 148 / W: 67	~274 ms	R: 1,116 MiB + W: 505 MiB

Latency Percentiles¶

Test	p50	p95	p99	Notes
Sequential Write	7.84 ms (avg)	—	max 226 ms	—
Sequential Read	11.2 ms (avg)	—	max 386 ms	Dramatically slower than data1/data2 (240–262 MiB/s). High-utilization ZFS pools suffer severe fragmentation, forcing reads to chase scattered blocks.
Random Read	230 ms	531 ms	1,401 ms	Catastrophically slow – 30x worse than data1/data2 (~12k IOPS). No bimodal cache-hit pattern; at 91% the ARC is ineffective and nearly every read hits fragmented on-disk blocks.
Random Write	13.7 ms	31.3 ms	35.4 ms	Roughly half the IOPS of data1/data2. ZFS COW at 91% must hunt harder for free blocks.
Mixed R/W	~200 ms	~827 ms	~1.2 s	Similar throughput to data1 at the 256 KiB block size, but tail latency is severe (p99 > 1.2 s).

Comparison with Other HDD Pools¶

Test	data0 (91%)	data1 (77%)	data2 (12%)	data0 vs. data2
Seq. Write	127 MiB/s	187 MiB/s	240 MiB/s	47% slower
Seq. Read	88.8 MiB/s	240 MiB/s	262 MiB/s	66% slower
Random Read IOPS	419	12,700	11,800	28x slower
Random Write IOPS	8,803	15,600	16,300	46% slower
Mixed Read BW	37.2 MiB/s	37 MiB/s	46 MiB/s	19% slower
Mixed Write BW	16.8 MiB/s	17 MiB/s	20 MiB/s	16% slower
Mixed p95 lat	827 ms	827 ms	380 ms	2.2x higher

Observations¶

Random read IOPS drops from ~12k to 419 – a 28–30x degradation. This single metric disqualifies data0 for any I/O-sensitive workload.
Sequential read collapses to 88.8 MiB/s (66% slower than data2). Fragmentation scatters what should be sequential blocks across multiple disk seeks.
Sequential write degrades moderately (127 MiB/s, 47% penalty). ZFS COW must search harder for contiguous free extents at high fill.
Random writes are least affected (8,803 IOPS, ~46% loss). ZFS’s ZIL and write coalescing still provide some buffering.
Mixed workload shows a floor effect. The 256 KiB block size partially masks fragmentation, but per-I/O tail latency (p99 > 1.2 s) reveals the underlying illness.

Pool: `data1` (HDD raidz1, 77% used)¶

Results¶

Test	Bandwidth	IOPS	Avg Latency	Total I/O
Sequential Write (1 MiB, 1 job)	187 MiB/s (196 MB/s)	187	5.3 ms	5,612 MiB
Sequential Read (1 MiB, 1 job)	240 MiB/s (251 MB/s)	239	4.2 ms	7,187 MiB
Random Read (4 KiB, 4j x 32d)	50 MiB/s (52 MB/s)	12,700	9.7 ms	1,489 MiB
Random Write (4 KiB, 4j x 32d)	61 MiB/s (64 MB/s)	15,600	7.9 ms	1,833 MiB
Mixed R/W (256 KiB, 4j x 16d)	R: 37 / W: 17 MiB/s	R: 148 / W: 67	~274 ms	R: 1,116 MiB + W: 505 MiB

Latency Percentiles¶

Test	p50	p95	p99	Notes
Sequential Write	5.34 ms (avg)	—	max 197 ms	—
Sequential Read	4.17 ms (avg)	—	max 160 ms	—
Random Read	125 us	396 us	451 ms	Bimodal – most reads from ARC cache (sub-ms), uncached reads hit disk (p99.5 = 566 ms, p99.9 = 1.28 s).
Random Write	7.1 ms	16.3 ms	19.3 ms	—
Mixed R/W	~199 ms	~827 ms	~1.2 s	Noticeably worse than data2 – higher utilization and raidz1 parity overhead amplify COW overhead under contention.

Comparison with `data2`¶

Test	data1	data2	Difference
Seq. Write	187 MiB/s	240 MiB/s	data2 is 28% faster
Seq. Read	240 MiB/s	262 MiB/s	data2 is 9% faster
Random Read IOPS	12,700	11,800	~equal (within noise)
Random Write IOPS	15,600	16,300	~equal
Mixed Read BW	37 MiB/s	46 MiB/s	data2 is 24% faster
Mixed Write BW	17 MiB/s	20 MiB/s	data2 is 18% faster
Mixed p95 lat	827 ms	380 ms	data2 has 2.2x lower tail latency

Observations¶

Sequential throughput: data2 is consistently faster (28% write, 9% read). data2 is a 2-disk stripe while data1 is raidz1 (4 disks, 1 parity) – raidz1 trades some write throughput for parity computation and single-disk fault tolerance.
Random IOPS: Both pools are comparable (~12–16k), suggesting similar underlying disk populations (both use 14 TB WD drives).
Mixed workload: data2 substantially outperforms data1, especially in tail latency (p95: 380 ms vs. 827 ms). Higher utilization (77% vs. 12%) and raidz1 parity overhead amplify write-amplification and COW overhead under contention.
Both pools are HDD-class: Neither approaches SSD/NVMe performance.

Pool: `data2` (HDD stripe, 12% used)¶

Results¶

Test	Bandwidth	IOPS	Avg Latency	Total I/O
Sequential Write (1 MiB, 1 job)	240 MiB/s (251 MB/s)	239	4.2 ms	7,191 MiB
Sequential Read (1 MiB, 1 job)	262 MiB/s (275 MB/s)	262	3.8 ms	7,875 MiB
Random Read (4 KiB, 4j x 32d)	46 MiB/s (48 MB/s)	11,800	10.2 ms	1,379 MiB
Random Write (4 KiB, 4j x 32d)	64 MiB/s (67 MB/s)	16,300	7.6 ms	1,912 MiB
Mixed R/W (256 KiB, 4j x 16d)	R: 46 / W: 20 MiB/s	R: 183 / W: 81	~225 ms	R: 1,382 MiB + W: 613 MiB

Latency Percentiles¶

Test	p50	p95	p99	Notes
Sequential Write	4.17 ms (avg)	—	max 101 ms	—
Sequential Read	3.8 ms (avg)	—	max 75.7 ms	—
Random Read	135 us	363 us	354 ms	Heavily bimodal – most reads from ARC cache (sub-ms), uncached reads hit spinning disk (p99.5 = 566 ms).
Random Write	6.7 ms	17.7 ms	21.6 ms	ZFS TXG batching absorbs random writes, giving smoother latency than random reads.
Mixed R/W	~199 ms	~380 ms	~840 ms	High latency under mixed load – ZFS COW overhead + disk seeks when reads and writes compete.

Analysis for pclean¶

Sequential throughput (~240–262 MiB/s): This pool is a 2-disk stripe (no redundancy), which gives good throughput from striping across both drives. With 8 Dask workers writing concurrently, aggregate demand could reach ~1–2 GB/s and saturate this pool. Keep nworkers <= 4 and cube_chunksize moderate to avoid contention.
Random IOPS (~12–16k): Adequate for CASA table system metadata operations. The bimodal random read latency (sub-ms median, ~350 ms tail) shows ZFS ARC caching is effective for hot data but uncached accesses hit spinning-disk latency.
Mixed workload (46 + 20 MiB/s): Significant latency increase (p50 ~200 ms). This is the regime where Dask continuum parallelism operates – concurrent visibility reads overlapping with intermediate image writes. Continuum imaging with many workers will be I/O-limited rather than CPU-limited.

NVMe vs. HDD Reference Comparison¶

Metric	nvme pool	Best HDD (data2)	Speedup
Seq. Write	317 MiB/s	240 MiB/s	1.3x
Seq. Read	1837 MiB/s	262 MiB/s	7x
Random Read IOPS	21,500	11,800	1.8x
Random Write IOPS	48,300	16,300	3x
Mixed Read BW	693 MiB/s	46 MiB/s	15x
Mixed Write BW	300 MiB/s	20 MiB/s	15x

Compared to a bare NVMe SSD (no ZFS overhead):

Metric	nvme pool (ZFS)	Bare NVMe (typical)	Overhead
Seq. Write	317 MiB/s	3,000+ MiB/s	~9x loss from ZFS COW
Seq. Read	1837 MiB/s	5,000+ MiB/s	~2.7x
Random Read IOPS	21,500	500,000+	~23x
Random Write IOPS	48,300	300,000+	~6x

Key Findings¶

ZFS fragmentation is catastrophic at 91% utilization. Performance degrades roughly linearly from 12% to 77%, then falls off a cliff approaching 91%. The ZFS best-practice threshold of ~80% maximum utilization is confirmed empirically.
Clear utilization–performance ordering: data2 (12%) > data1 (77%) >> data0 (91%) for every metric except random IOPS where data1 and data2 are equivalent.
The nvme pool dominates all HDD pools by 1.3–15x depending on the access pattern. Mixed workloads see the largest gap (15x).
ZFS adds significant overhead even on NVMe. Sequential write is only 317 MiB/s (vs. ~3 GB/s bare), a ~9x penalty from COW, checksumming, and TXG commit. This is the cost of ZFS’s data integrity guarantees.
All HDD pools are HDD-class. Neither data1 nor data2 approaches NVMe performance. For I/O-intensive pclean runs, local NVMe is always preferred.

Recommendations for pclean¶

Scenario	Preferred Pool	Max Workers	Notes
Working directory (`imagename`, `local_directory`)	nvme	CPU/memory limited	NVMe I/O unlikely to be first bottleneck.
Working directory (ZFS only)	data2 > data1	3–4	I/O contention becomes limiting before CPU.
Archival / final products	data1 or data2	N/A	Sequential write is sufficient for archival.
Never use for I/O-sensitive work	data0	1 (last resort)	28x worse random read; even single-worker runs will be I/O-bottlenecked.

Specific Guidance¶

Use local NVMe for imagename and local_directory; keep final products on ZFS.
Set local_directory to fast local storage so Dask spill-to-disk does not add ZFS latency.
Keep cube_chunksize moderate on HDD pools to avoid many tiny sub-cubes that increase metadata overhead.
Consider freeing space on data0 (target < 80% / ~17.6 TiB used) to recover usable performance.
ZFS recordsize tuning: If a pool is dedicated to imaging, recordsize=1M (matching dominant I/O block size) may improve sequential throughput.
Monitor nvme utilization. At 84% it is approaching the caution zone; crossing 90% risks the same fragmentation cliff seen on data0.

I/O Report – ZFS Pools on xenon¶

Pools Under Test¶

Cross-Pool Summary¶

Pool: nvme (NVMe SSD stripe, 84% used)¶

Results¶

Latency Percentiles¶

Observations¶

Pool: data0 (HDD 3x mirror + L2ARC, 91% used)¶

Results¶

Latency Percentiles¶

Comparison with Other HDD Pools¶

Observations¶

Pool: data1 (HDD raidz1, 77% used)¶

Results¶

Latency Percentiles¶

Comparison with data2¶

Observations¶

Pool: data2 (HDD stripe, 12% used)¶

Results¶

Latency Percentiles¶

Analysis for pclean¶

NVMe vs. HDD Reference Comparison¶

Key Findings¶

Recommendations for pclean¶

Specific Guidance¶

I/O Report – ZFS Pools on `xenon`¶

Pool: `nvme` (NVMe SSD stripe, 84% used)¶

Pool: `data0` (HDD 3x mirror + L2ARC, 91% used)¶

Pool: `data1` (HDD raidz1, 77% used)¶

Comparison with `data2`¶

Pool: `data2` (HDD stripe, 12% used)¶