# I/O Report -- ZFS Pools on `xenon` **Date:** 2026-02-27 **Host:** xenon (Ubuntu 24.04.4 LTS) **Tool:** fio 3.36, `ioengine=libaio`, `direct=1` (O_DIRECT, bypasses page cache) **Duration:** 30 seconds per test --- ## Pools Under Test | Pool | Backing | Layout | Disks | Raw Size | Used | Benchmark Path | |------|---------|--------|-------|----------|------|----------------| | `nvme` | NVMe SSD | stripe (2 drives) | 2x 932 GB SK Hynix P31 | 1.81 TiB | 84% | `/pool/nvme/benchmark` | | `data0` | HDD | 3x mirror (6 drives) + L2ARC | 6x 8 TB WD + 120 GB SSD cache | 21.8 TiB | 91% | `/pool/data0/benchmark` | | `data1` | HDD | raidz1 (4 drives) | 4x 14 TB WD | 50.9 TiB | 77% | `/pool/data1/benchmark` | | `data2` | HDD | stripe (2 drives) | 2x 14 TB WD | 25.4 TiB | 12% | `/pool/data2/benchmark` | > **Note:** `nvme` and `data2` are striped (no redundancy). `data0` has mirror > redundancy and an SSD read cache (L2ARC). `data1` has single-parity raidz1. --- ## Cross-Pool Summary | Test | Pattern | Block Size | Jobs | Depth | nvme (84%) | data0 (91%) | data1 (77%) | data2 (12%) | |------|---------|-----------|------|-------|------------|-------------|-------------|-------------| | Seq. Write | write | 1 MiB | 1 | 1 | **317 MiB/s** | 127 MiB/s | 187 MiB/s | 240 MiB/s | | Seq. Read | read | 1 MiB | 1 | 1 | **1837 MiB/s** | 88.8 MiB/s | 240 MiB/s | 262 MiB/s | | Rand Read IOPS | randread | 4 KiB | 4 | 32 | **21,500** | 419 | 12,700 | 11,800 | | Rand Write IOPS | randwrite | 4 KiB | 4 | 32 | **48,300** | 8,803 | 15,600 | 16,300 | | Mixed Read BW | randrw 70/30 | 256 KiB | 4 | 16 | **693 MiB/s** | 37.2 MiB/s | 37 MiB/s | 46 MiB/s | | Mixed Write BW | randrw 70/30 | 256 KiB | 4 | 16 | **300 MiB/s** | 16.8 MiB/s | 17 MiB/s | 20 MiB/s | --- ## Pool: `nvme` (NVMe SSD stripe, 84% used) ### Results | Test | Bandwidth | IOPS | Avg Latency | Total I/O | |------|-----------|------|-------------|-----------| | Sequential Write (1 MiB, 1 job) | 317 MiB/s (333 MB/s) | 317 | 3.15 ms | 9,522 MiB | | Sequential Read (1 MiB, 1 job) | 1837 MiB/s (1926 MB/s) | 1837 | 0.54 ms | 53.8 GiB | | Random Read (4 KiB, 4j x 32d) | 84.0 MiB/s (88.1 MB/s) | 21,500 | 5.95 ms | 2,520 MiB | | Random Write (4 KiB, 4j x 32d) | 189 MiB/s (198 MB/s) | 48,300 | 2.65 ms | 5,659 MiB | | Mixed R/W (256 KiB, 4j x 16d) | R: 693 / W: 300 MiB/s | R: 2,770 / W: 1,198 | ~15.8 ms | R: 20.3 GiB + W: 8,987 MiB | ### Latency Percentiles | Test | p50 | p95 | p99 | |------|-----|-----|-----| | Sequential Write | 3.15 ms (avg) | — | max 10.8 ms | | Sequential Read | 0.54 ms (avg) | — | max 20.5 ms | | Random Read | 4.1 ms | 10.0 ms | 14.1 ms | | Random Write | 1.9 ms | 5.7 ms | 9.2 ms | | Mixed R (avg) | ~15.8 ms | — | — | | Mixed W (avg) | ~16.7 ms | — | — | ### Observations - NVMe-backed and far faster than the HDD pools in every metric. - Sequential read throughput (1.8 GiB/s) and mixed workload numbers make this pool suitable for high-throughput pclean workloads. - Random IOPS are orders of magnitude higher than the HDD pools. - Utilization (84%) is approaching the caution zone. If the pool reaches >90%, the fragmentation effects seen on `data0` could appear; keep free space above ~15--20% where possible. --- ## Pool: `data0` (HDD 3x mirror + L2ARC, 91% used) ### Results | Test | Bandwidth | IOPS | Avg Latency | Total I/O | |------|-----------|------|-------------|-----------| | Sequential Write (1 MiB, 1 job) | 127 MiB/s (134 MB/s) | 127 | 7.8 ms | 3,824 MiB | | Sequential Read (1 MiB, 1 job) | 88.8 MiB/s (93.2 MB/s) | 88 | 11.2 ms | 2,666 MiB | | Random Read (4 KiB, 4j x 32d) | 1.68 MiB/s (1.72 MB/s) | 419 | 283 ms | 49.2 MiB | | Random Write (4 KiB, 4j x 32d) | 34.4 MiB/s (36.1 MB/s) | 8,803 | 14.1 ms | 1,032 MiB | | Mixed R/W (256 KiB, 4j x 16d) | R: 37.2 / W: 16.8 MiB/s | R: 148 / W: 67 | ~274 ms | R: 1,116 MiB + W: 505 MiB | ### Latency Percentiles | Test | p50 | p95 | p99 | Notes | |------|-----|-----|-----|-------| | Sequential Write | 7.84 ms (avg) | — | max 226 ms | — | | Sequential Read | 11.2 ms (avg) | — | max 386 ms | Dramatically slower than data1/data2 (240--262 MiB/s). High-utilization ZFS pools suffer severe fragmentation, forcing reads to chase scattered blocks. | | Random Read | 230 ms | 531 ms | 1,401 ms | Catastrophically slow -- **30x worse** than data1/data2 (~12k IOPS). No bimodal cache-hit pattern; at 91% the ARC is ineffective and nearly every read hits fragmented on-disk blocks. | | Random Write | 13.7 ms | 31.3 ms | 35.4 ms | Roughly half the IOPS of data1/data2. ZFS COW at 91% must hunt harder for free blocks. | | Mixed R/W | ~200 ms | ~827 ms | ~1.2 s | Similar throughput to data1 at the 256 KiB block size, but tail latency is severe (p99 > 1.2 s). | ### Comparison with Other HDD Pools | Test | data0 (91%) | data1 (77%) | data2 (12%) | data0 vs. data2 | |------|-------------|-------------|-------------|-----------------| | Seq. Write | 127 MiB/s | 187 MiB/s | 240 MiB/s | **47% slower** | | Seq. Read | 88.8 MiB/s | 240 MiB/s | 262 MiB/s | **66% slower** | | Random Read IOPS | 419 | 12,700 | 11,800 | **28x slower** | | Random Write IOPS | 8,803 | 15,600 | 16,300 | **46% slower** | | Mixed Read BW | 37.2 MiB/s | 37 MiB/s | 46 MiB/s | **19% slower** | | Mixed Write BW | 16.8 MiB/s | 17 MiB/s | 20 MiB/s | **16% slower** | | Mixed p95 lat | 827 ms | 827 ms | 380 ms | **2.2x higher** | ### Observations - **Random read IOPS drops from ~12k to 419 -- a 28--30x degradation.** This single metric disqualifies `data0` for any I/O-sensitive workload. - **Sequential read collapses to 88.8 MiB/s** (66% slower than data2). Fragmentation scatters what should be sequential blocks across multiple disk seeks. - **Sequential write degrades moderately** (127 MiB/s, 47% penalty). ZFS COW must search harder for contiguous free extents at high fill. - **Random writes are least affected** (8,803 IOPS, ~46% loss). ZFS's ZIL and write coalescing still provide some buffering. - **Mixed workload shows a floor effect.** The 256 KiB block size partially masks fragmentation, but per-I/O tail latency (p99 > 1.2 s) reveals the underlying illness. --- ## Pool: `data1` (HDD raidz1, 77% used) ### Results | Test | Bandwidth | IOPS | Avg Latency | Total I/O | |------|-----------|------|-------------|-----------| | Sequential Write (1 MiB, 1 job) | 187 MiB/s (196 MB/s) | 187 | 5.3 ms | 5,612 MiB | | Sequential Read (1 MiB, 1 job) | 240 MiB/s (251 MB/s) | 239 | 4.2 ms | 7,187 MiB | | Random Read (4 KiB, 4j x 32d) | 50 MiB/s (52 MB/s) | 12,700 | 9.7 ms | 1,489 MiB | | Random Write (4 KiB, 4j x 32d) | 61 MiB/s (64 MB/s) | 15,600 | 7.9 ms | 1,833 MiB | | Mixed R/W (256 KiB, 4j x 16d) | R: 37 / W: 17 MiB/s | R: 148 / W: 67 | ~274 ms | R: 1,116 MiB + W: 505 MiB | ### Latency Percentiles | Test | p50 | p95 | p99 | Notes | |------|-----|-----|-----|-------| | Sequential Write | 5.34 ms (avg) | — | max 197 ms | — | | Sequential Read | 4.17 ms (avg) | — | max 160 ms | — | | Random Read | 125 us | 396 us | 451 ms | Bimodal -- most reads from ARC cache (sub-ms), uncached reads hit disk (p99.5 = 566 ms, p99.9 = 1.28 s). | | Random Write | 7.1 ms | 16.3 ms | 19.3 ms | — | | Mixed R/W | ~199 ms | ~827 ms | ~1.2 s | Noticeably worse than data2 -- higher utilization and raidz1 parity overhead amplify COW overhead under contention. | ### Comparison with `data2` | Test | data1 | data2 | Difference | |------|-------|-------|------------| | Seq. Write | 187 MiB/s | 240 MiB/s | data2 is **28% faster** | | Seq. Read | 240 MiB/s | 262 MiB/s | data2 is **9% faster** | | Random Read IOPS | 12,700 | 11,800 | ~equal (within noise) | | Random Write IOPS | 15,600 | 16,300 | ~equal | | Mixed Read BW | 37 MiB/s | 46 MiB/s | data2 is **24% faster** | | Mixed Write BW | 17 MiB/s | 20 MiB/s | data2 is **18% faster** | | Mixed p95 lat | 827 ms | 380 ms | data2 has **2.2x lower** tail latency | ### Observations - **Sequential throughput:** `data2` is consistently faster (28% write, 9% read). `data2` is a 2-disk stripe while `data1` is raidz1 (4 disks, 1 parity) -- raidz1 trades some write throughput for parity computation and single-disk fault tolerance. - **Random IOPS:** Both pools are comparable (~12--16k), suggesting similar underlying disk populations (both use 14 TB WD drives). - **Mixed workload:** `data2` substantially outperforms `data1`, especially in tail latency (p95: 380 ms vs. 827 ms). Higher utilization (77% vs. 12%) and raidz1 parity overhead amplify write-amplification and COW overhead under contention. - **Both pools are HDD-class:** Neither approaches SSD/NVMe performance. --- ## Pool: `data2` (HDD stripe, 12% used) ### Results | Test | Bandwidth | IOPS | Avg Latency | Total I/O | |------|-----------|------|-------------|-----------| | Sequential Write (1 MiB, 1 job) | 240 MiB/s (251 MB/s) | 239 | 4.2 ms | 7,191 MiB | | Sequential Read (1 MiB, 1 job) | 262 MiB/s (275 MB/s) | 262 | 3.8 ms | 7,875 MiB | | Random Read (4 KiB, 4j x 32d) | 46 MiB/s (48 MB/s) | 11,800 | 10.2 ms | 1,379 MiB | | Random Write (4 KiB, 4j x 32d) | 64 MiB/s (67 MB/s) | 16,300 | 7.6 ms | 1,912 MiB | | Mixed R/W (256 KiB, 4j x 16d) | R: 46 / W: 20 MiB/s | R: 183 / W: 81 | ~225 ms | R: 1,382 MiB + W: 613 MiB | ### Latency Percentiles | Test | p50 | p95 | p99 | Notes | |------|-----|-----|-----|-------| | Sequential Write | 4.17 ms (avg) | — | max 101 ms | — | | Sequential Read | 3.8 ms (avg) | — | max 75.7 ms | — | | Random Read | 135 us | 363 us | 354 ms | Heavily bimodal -- most reads from ARC cache (sub-ms), uncached reads hit spinning disk (p99.5 = 566 ms). | | Random Write | 6.7 ms | 17.7 ms | 21.6 ms | ZFS TXG batching absorbs random writes, giving smoother latency than random reads. | | Mixed R/W | ~199 ms | ~380 ms | ~840 ms | High latency under mixed load -- ZFS COW overhead + disk seeks when reads and writes compete. | ### Analysis for pclean - **Sequential throughput (~240--262 MiB/s):** This pool is a 2-disk stripe (no redundancy), which gives good throughput from striping across both drives. With 8 Dask workers writing concurrently, aggregate demand could reach ~1--2 GB/s and saturate this pool. Keep `nworkers <= 4` and `cube_chunksize` moderate to avoid contention. - **Random IOPS (~12--16k):** Adequate for CASA table system metadata operations. The bimodal random read latency (sub-ms median, ~350 ms tail) shows ZFS ARC caching is effective for hot data but uncached accesses hit spinning-disk latency. - **Mixed workload (46 + 20 MiB/s):** Significant latency increase (p50 ~200 ms). This is the regime where Dask continuum parallelism operates -- concurrent visibility reads overlapping with intermediate image writes. Continuum imaging with many workers will be I/O-limited rather than CPU-limited. --- ## NVMe vs. HDD Reference Comparison | Metric | nvme pool | Best HDD (data2) | Speedup | |--------|-----------|-------------------|---------| | Seq. Write | 317 MiB/s | 240 MiB/s | 1.3x | | Seq. Read | 1837 MiB/s | 262 MiB/s | 7x | | Random Read IOPS | 21,500 | 11,800 | 1.8x | | Random Write IOPS | 48,300 | 16,300 | 3x | | Mixed Read BW | 693 MiB/s | 46 MiB/s | 15x | | Mixed Write BW | 300 MiB/s | 20 MiB/s | 15x | Compared to a bare NVMe SSD (no ZFS overhead): | Metric | nvme pool (ZFS) | Bare NVMe (typical) | Overhead | |--------|-----------------|---------------------|----------| | Seq. Write | 317 MiB/s | 3,000+ MiB/s | ~9x loss from ZFS COW | | Seq. Read | 1837 MiB/s | 5,000+ MiB/s | ~2.7x | | Random Read IOPS | 21,500 | 500,000+ | ~23x | | Random Write IOPS | 48,300 | 300,000+ | ~6x | --- ## Key Findings 1. **ZFS fragmentation is catastrophic at 91% utilization.** Performance degrades roughly linearly from 12% to 77%, then **falls off a cliff** approaching 91%. The ZFS best-practice threshold of ~80% maximum utilization is confirmed empirically. 2. **Clear utilization--performance ordering:** `data2` (12%) > `data1` (77%) >> `data0` (91%) for every metric except random IOPS where data1 and data2 are equivalent. 3. **The `nvme` pool dominates all HDD pools** by 1.3--15x depending on the access pattern. Mixed workloads see the largest gap (15x). 4. **ZFS adds significant overhead even on NVMe.** Sequential write is only 317 MiB/s (vs. ~3 GB/s bare), a ~9x penalty from COW, checksumming, and TXG commit. This is the cost of ZFS's data integrity guarantees. 5. **All HDD pools are HDD-class.** Neither data1 nor data2 approaches NVMe performance. For I/O-intensive pclean runs, local NVMe is always preferred. --- ## Recommendations for pclean | Scenario | Preferred Pool | Max Workers | Notes | |----------|---------------|-------------|-------| | Working directory (`imagename`, `local_directory`) | **nvme** | CPU/memory limited | NVMe I/O unlikely to be first bottleneck. | | Working directory (ZFS only) | **data2** > data1 | 3--4 | I/O contention becomes limiting before CPU. | | Archival / final products | data1 or data2 | N/A | Sequential write is sufficient for archival. | | **Never** use for I/O-sensitive work | **data0** | 1 (last resort) | 28x worse random read; even single-worker runs will be I/O-bottlenecked. | ### Specific Guidance - **Use local NVMe** for `imagename` and `local_directory`; keep final products on ZFS. - **Set `local_directory`** to fast local storage so Dask spill-to-disk does not add ZFS latency. - **Keep `cube_chunksize` moderate** on HDD pools to avoid many tiny sub-cubes that increase metadata overhead. - **Consider freeing space on `data0`** (target < 80% / ~17.6 TiB used) to recover usable performance. - **ZFS recordsize tuning:** If a pool is dedicated to imaging, `recordsize=1M` (matching dominant I/O block size) may improve sequential throughput. - **Monitor `nvme` utilization.** At 84% it is approaching the caution zone; crossing 90% risks the same fragmentation cliff seen on `data0`.