I/O Report – ZFS Pools on xenon¶
Date: 2026-02-27
Host: xenon (Ubuntu 24.04.4 LTS)
Tool: fio 3.36, ioengine=libaio, direct=1 (O_DIRECT, bypasses page cache)
Duration: 30 seconds per test
Pools Under Test¶
Pool |
Backing |
Layout |
Disks |
Raw Size |
Used |
Benchmark Path |
|---|---|---|---|---|---|---|
|
NVMe SSD |
stripe (2 drives) |
2x 932 GB SK Hynix P31 |
1.81 TiB |
84% |
|
|
HDD |
3x mirror (6 drives) + L2ARC |
6x 8 TB WD + 120 GB SSD cache |
21.8 TiB |
91% |
|
|
HDD |
raidz1 (4 drives) |
4x 14 TB WD |
50.9 TiB |
77% |
|
|
HDD |
stripe (2 drives) |
2x 14 TB WD |
25.4 TiB |
12% |
|
Note:
nvmeanddata2are striped (no redundancy).data0has mirror redundancy and an SSD read cache (L2ARC).data1has single-parity raidz1.
Cross-Pool Summary¶
Test |
Pattern |
Block Size |
Jobs |
Depth |
nvme (84%) |
data0 (91%) |
data1 (77%) |
data2 (12%) |
|---|---|---|---|---|---|---|---|---|
Seq. Write |
write |
1 MiB |
1 |
1 |
317 MiB/s |
127 MiB/s |
187 MiB/s |
240 MiB/s |
Seq. Read |
read |
1 MiB |
1 |
1 |
1837 MiB/s |
88.8 MiB/s |
240 MiB/s |
262 MiB/s |
Rand Read IOPS |
randread |
4 KiB |
4 |
32 |
21,500 |
419 |
12,700 |
11,800 |
Rand Write IOPS |
randwrite |
4 KiB |
4 |
32 |
48,300 |
8,803 |
15,600 |
16,300 |
Mixed Read BW |
randrw 70/30 |
256 KiB |
4 |
16 |
693 MiB/s |
37.2 MiB/s |
37 MiB/s |
46 MiB/s |
Mixed Write BW |
randrw 70/30 |
256 KiB |
4 |
16 |
300 MiB/s |
16.8 MiB/s |
17 MiB/s |
20 MiB/s |
Pool: nvme (NVMe SSD stripe, 84% used)¶
Results¶
Test |
Bandwidth |
IOPS |
Avg Latency |
Total I/O |
|---|---|---|---|---|
Sequential Write (1 MiB, 1 job) |
317 MiB/s (333 MB/s) |
317 |
3.15 ms |
9,522 MiB |
Sequential Read (1 MiB, 1 job) |
1837 MiB/s (1926 MB/s) |
1837 |
0.54 ms |
53.8 GiB |
Random Read (4 KiB, 4j x 32d) |
84.0 MiB/s (88.1 MB/s) |
21,500 |
5.95 ms |
2,520 MiB |
Random Write (4 KiB, 4j x 32d) |
189 MiB/s (198 MB/s) |
48,300 |
2.65 ms |
5,659 MiB |
Mixed R/W (256 KiB, 4j x 16d) |
R: 693 / W: 300 MiB/s |
R: 2,770 / W: 1,198 |
~15.8 ms |
R: 20.3 GiB + W: 8,987 MiB |
Latency Percentiles¶
Test |
p50 |
p95 |
p99 |
|---|---|---|---|
Sequential Write |
3.15 ms (avg) |
— |
max 10.8 ms |
Sequential Read |
0.54 ms (avg) |
— |
max 20.5 ms |
Random Read |
4.1 ms |
10.0 ms |
14.1 ms |
Random Write |
1.9 ms |
5.7 ms |
9.2 ms |
Mixed R (avg) |
~15.8 ms |
— |
— |
Mixed W (avg) |
~16.7 ms |
— |
— |
Observations¶
NVMe-backed and far faster than the HDD pools in every metric.
Sequential read throughput (1.8 GiB/s) and mixed workload numbers make this pool suitable for high-throughput pclean workloads.
Random IOPS are orders of magnitude higher than the HDD pools.
Utilization (84%) is approaching the caution zone. If the pool reaches >90%, the fragmentation effects seen on
data0could appear; keep free space above ~15–20% where possible.
Pool: data0 (HDD 3x mirror + L2ARC, 91% used)¶
Results¶
Test |
Bandwidth |
IOPS |
Avg Latency |
Total I/O |
|---|---|---|---|---|
Sequential Write (1 MiB, 1 job) |
127 MiB/s (134 MB/s) |
127 |
7.8 ms |
3,824 MiB |
Sequential Read (1 MiB, 1 job) |
88.8 MiB/s (93.2 MB/s) |
88 |
11.2 ms |
2,666 MiB |
Random Read (4 KiB, 4j x 32d) |
1.68 MiB/s (1.72 MB/s) |
419 |
283 ms |
49.2 MiB |
Random Write (4 KiB, 4j x 32d) |
34.4 MiB/s (36.1 MB/s) |
8,803 |
14.1 ms |
1,032 MiB |
Mixed R/W (256 KiB, 4j x 16d) |
R: 37.2 / W: 16.8 MiB/s |
R: 148 / W: 67 |
~274 ms |
R: 1,116 MiB + W: 505 MiB |
Latency Percentiles¶
Test |
p50 |
p95 |
p99 |
Notes |
|---|---|---|---|---|
Sequential Write |
7.84 ms (avg) |
— |
max 226 ms |
— |
Sequential Read |
11.2 ms (avg) |
— |
max 386 ms |
Dramatically slower than data1/data2 (240–262 MiB/s). High-utilization ZFS pools suffer severe fragmentation, forcing reads to chase scattered blocks. |
Random Read |
230 ms |
531 ms |
1,401 ms |
Catastrophically slow – 30x worse than data1/data2 (~12k IOPS). No bimodal cache-hit pattern; at 91% the ARC is ineffective and nearly every read hits fragmented on-disk blocks. |
Random Write |
13.7 ms |
31.3 ms |
35.4 ms |
Roughly half the IOPS of data1/data2. ZFS COW at 91% must hunt harder for free blocks. |
Mixed R/W |
~200 ms |
~827 ms |
~1.2 s |
Similar throughput to data1 at the 256 KiB block size, but tail latency is severe (p99 > 1.2 s). |
Comparison with Other HDD Pools¶
Test |
data0 (91%) |
data1 (77%) |
data2 (12%) |
data0 vs. data2 |
|---|---|---|---|---|
Seq. Write |
127 MiB/s |
187 MiB/s |
240 MiB/s |
47% slower |
Seq. Read |
88.8 MiB/s |
240 MiB/s |
262 MiB/s |
66% slower |
Random Read IOPS |
419 |
12,700 |
11,800 |
28x slower |
Random Write IOPS |
8,803 |
15,600 |
16,300 |
46% slower |
Mixed Read BW |
37.2 MiB/s |
37 MiB/s |
46 MiB/s |
19% slower |
Mixed Write BW |
16.8 MiB/s |
17 MiB/s |
20 MiB/s |
16% slower |
Mixed p95 lat |
827 ms |
827 ms |
380 ms |
2.2x higher |
Observations¶
Random read IOPS drops from ~12k to 419 – a 28–30x degradation. This single metric disqualifies
data0for any I/O-sensitive workload.Sequential read collapses to 88.8 MiB/s (66% slower than data2). Fragmentation scatters what should be sequential blocks across multiple disk seeks.
Sequential write degrades moderately (127 MiB/s, 47% penalty). ZFS COW must search harder for contiguous free extents at high fill.
Random writes are least affected (8,803 IOPS, ~46% loss). ZFS’s ZIL and write coalescing still provide some buffering.
Mixed workload shows a floor effect. The 256 KiB block size partially masks fragmentation, but per-I/O tail latency (p99 > 1.2 s) reveals the underlying illness.
Pool: data1 (HDD raidz1, 77% used)¶
Results¶
Test |
Bandwidth |
IOPS |
Avg Latency |
Total I/O |
|---|---|---|---|---|
Sequential Write (1 MiB, 1 job) |
187 MiB/s (196 MB/s) |
187 |
5.3 ms |
5,612 MiB |
Sequential Read (1 MiB, 1 job) |
240 MiB/s (251 MB/s) |
239 |
4.2 ms |
7,187 MiB |
Random Read (4 KiB, 4j x 32d) |
50 MiB/s (52 MB/s) |
12,700 |
9.7 ms |
1,489 MiB |
Random Write (4 KiB, 4j x 32d) |
61 MiB/s (64 MB/s) |
15,600 |
7.9 ms |
1,833 MiB |
Mixed R/W (256 KiB, 4j x 16d) |
R: 37 / W: 17 MiB/s |
R: 148 / W: 67 |
~274 ms |
R: 1,116 MiB + W: 505 MiB |
Latency Percentiles¶
Test |
p50 |
p95 |
p99 |
Notes |
|---|---|---|---|---|
Sequential Write |
5.34 ms (avg) |
— |
max 197 ms |
— |
Sequential Read |
4.17 ms (avg) |
— |
max 160 ms |
— |
Random Read |
125 us |
396 us |
451 ms |
Bimodal – most reads from ARC cache (sub-ms), uncached reads hit disk (p99.5 = 566 ms, p99.9 = 1.28 s). |
Random Write |
7.1 ms |
16.3 ms |
19.3 ms |
— |
Mixed R/W |
~199 ms |
~827 ms |
~1.2 s |
Noticeably worse than data2 – higher utilization and raidz1 parity overhead amplify COW overhead under contention. |
Comparison with data2¶
Test |
data1 |
data2 |
Difference |
|---|---|---|---|
Seq. Write |
187 MiB/s |
240 MiB/s |
data2 is 28% faster |
Seq. Read |
240 MiB/s |
262 MiB/s |
data2 is 9% faster |
Random Read IOPS |
12,700 |
11,800 |
~equal (within noise) |
Random Write IOPS |
15,600 |
16,300 |
~equal |
Mixed Read BW |
37 MiB/s |
46 MiB/s |
data2 is 24% faster |
Mixed Write BW |
17 MiB/s |
20 MiB/s |
data2 is 18% faster |
Mixed p95 lat |
827 ms |
380 ms |
data2 has 2.2x lower tail latency |
Observations¶
Sequential throughput:
data2is consistently faster (28% write, 9% read).data2is a 2-disk stripe whiledata1is raidz1 (4 disks, 1 parity) – raidz1 trades some write throughput for parity computation and single-disk fault tolerance.Random IOPS: Both pools are comparable (~12–16k), suggesting similar underlying disk populations (both use 14 TB WD drives).
Mixed workload:
data2substantially outperformsdata1, especially in tail latency (p95: 380 ms vs. 827 ms). Higher utilization (77% vs. 12%) and raidz1 parity overhead amplify write-amplification and COW overhead under contention.Both pools are HDD-class: Neither approaches SSD/NVMe performance.
Pool: data2 (HDD stripe, 12% used)¶
Results¶
Test |
Bandwidth |
IOPS |
Avg Latency |
Total I/O |
|---|---|---|---|---|
Sequential Write (1 MiB, 1 job) |
240 MiB/s (251 MB/s) |
239 |
4.2 ms |
7,191 MiB |
Sequential Read (1 MiB, 1 job) |
262 MiB/s (275 MB/s) |
262 |
3.8 ms |
7,875 MiB |
Random Read (4 KiB, 4j x 32d) |
46 MiB/s (48 MB/s) |
11,800 |
10.2 ms |
1,379 MiB |
Random Write (4 KiB, 4j x 32d) |
64 MiB/s (67 MB/s) |
16,300 |
7.6 ms |
1,912 MiB |
Mixed R/W (256 KiB, 4j x 16d) |
R: 46 / W: 20 MiB/s |
R: 183 / W: 81 |
~225 ms |
R: 1,382 MiB + W: 613 MiB |
Latency Percentiles¶
Test |
p50 |
p95 |
p99 |
Notes |
|---|---|---|---|---|
Sequential Write |
4.17 ms (avg) |
— |
max 101 ms |
— |
Sequential Read |
3.8 ms (avg) |
— |
max 75.7 ms |
— |
Random Read |
135 us |
363 us |
354 ms |
Heavily bimodal – most reads from ARC cache (sub-ms), uncached reads hit spinning disk (p99.5 = 566 ms). |
Random Write |
6.7 ms |
17.7 ms |
21.6 ms |
ZFS TXG batching absorbs random writes, giving smoother latency than random reads. |
Mixed R/W |
~199 ms |
~380 ms |
~840 ms |
High latency under mixed load – ZFS COW overhead + disk seeks when reads and writes compete. |
Analysis for pclean¶
Sequential throughput (~240–262 MiB/s): This pool is a 2-disk stripe (no redundancy), which gives good throughput from striping across both drives. With 8 Dask workers writing concurrently, aggregate demand could reach ~1–2 GB/s and saturate this pool. Keep
nworkers <= 4andcube_chunksizemoderate to avoid contention.Random IOPS (~12–16k): Adequate for CASA table system metadata operations. The bimodal random read latency (sub-ms median, ~350 ms tail) shows ZFS ARC caching is effective for hot data but uncached accesses hit spinning-disk latency.
Mixed workload (46 + 20 MiB/s): Significant latency increase (p50 ~200 ms). This is the regime where Dask continuum parallelism operates – concurrent visibility reads overlapping with intermediate image writes. Continuum imaging with many workers will be I/O-limited rather than CPU-limited.
NVMe vs. HDD Reference Comparison¶
Metric |
nvme pool |
Best HDD (data2) |
Speedup |
|---|---|---|---|
Seq. Write |
317 MiB/s |
240 MiB/s |
1.3x |
Seq. Read |
1837 MiB/s |
262 MiB/s |
7x |
Random Read IOPS |
21,500 |
11,800 |
1.8x |
Random Write IOPS |
48,300 |
16,300 |
3x |
Mixed Read BW |
693 MiB/s |
46 MiB/s |
15x |
Mixed Write BW |
300 MiB/s |
20 MiB/s |
15x |
Compared to a bare NVMe SSD (no ZFS overhead):
Metric |
nvme pool (ZFS) |
Bare NVMe (typical) |
Overhead |
|---|---|---|---|
Seq. Write |
317 MiB/s |
3,000+ MiB/s |
~9x loss from ZFS COW |
Seq. Read |
1837 MiB/s |
5,000+ MiB/s |
~2.7x |
Random Read IOPS |
21,500 |
500,000+ |
~23x |
Random Write IOPS |
48,300 |
300,000+ |
~6x |
Key Findings¶
ZFS fragmentation is catastrophic at 91% utilization. Performance degrades roughly linearly from 12% to 77%, then falls off a cliff approaching 91%. The ZFS best-practice threshold of ~80% maximum utilization is confirmed empirically.
Clear utilization–performance ordering:
data2(12%) >data1(77%) >>data0(91%) for every metric except random IOPS where data1 and data2 are equivalent.The
nvmepool dominates all HDD pools by 1.3–15x depending on the access pattern. Mixed workloads see the largest gap (15x).ZFS adds significant overhead even on NVMe. Sequential write is only 317 MiB/s (vs. ~3 GB/s bare), a ~9x penalty from COW, checksumming, and TXG commit. This is the cost of ZFS’s data integrity guarantees.
All HDD pools are HDD-class. Neither data1 nor data2 approaches NVMe performance. For I/O-intensive pclean runs, local NVMe is always preferred.
Recommendations for pclean¶
Scenario |
Preferred Pool |
Max Workers |
Notes |
|---|---|---|---|
Working directory ( |
nvme |
CPU/memory limited |
NVMe I/O unlikely to be first bottleneck. |
Working directory (ZFS only) |
data2 > data1 |
3–4 |
I/O contention becomes limiting before CPU. |
Archival / final products |
data1 or data2 |
N/A |
Sequential write is sufficient for archival. |
Never use for I/O-sensitive work |
data0 |
1 (last resort) |
28x worse random read; even single-worker runs will be I/O-bottlenecked. |
Specific Guidance¶
Use local NVMe for
imagenameandlocal_directory; keep final products on ZFS.Set
local_directoryto fast local storage so Dask spill-to-disk does not add ZFS latency.Keep
cube_chunksizemoderate on HDD pools to avoid many tiny sub-cubes that increase metadata overhead.Consider freeing space on
data0(target < 80% / ~17.6 TiB used) to recover usable performance.ZFS recordsize tuning: If a pool is dedicated to imaging,
recordsize=1M(matching dominant I/O block size) may improve sequential throughput.Monitor
nvmeutilization. At 84% it is approaching the caution zone; crossing 90% risks the same fragmentation cliff seen ondata0.