ALMA Cube Imaging — pclean Performance Report (v1)¶

1. Test Configuration¶

Parameter	Value
Dataset	ALMA `uid___A002_Xf0fd41_X5f5a_target.ms`, SPW 25, IRC+10216
Image size	8000 × 8000 px, 0.0046 arcsec/px
Spectral	specmode=cube, 1000 channels
Weighting	briggsbwtaper, robust=0.5
Deconvolver	hogbom, niter=50000 — configured but 0 iterations executed (see note below)
Parallelism	Dask `LocalCluster`, 10 workers, `cube_chunksize=1`
Subcubes	1000 (one per channel)
Log files	`test_alma_pclean_1.log` / `test_alma_pclean_1.rec`

2. Result Summary¶

Outcome	Value
Completed	Yes
Total wall time	13h 47m 33s
Imaging wall time	10h 25m 17s
Concat wall time	3h 20m (24% of total)
OOM killed	No
Peak RSS	58.6 GB

Note on deconvolution: niter=50000 was set in the run script but no Hogbom iterations were actually executed. SynthesisDeconvolver::setupDeconvolution was called (hogbom configured), initMinorCycle ran, and setupMask (auto-multithresh) ran. However, executeminorcycle never appears in the log. The first convergence check logged “Peak residual within mask : 0” (empty mask), which likely caused iterbotsink.cleanComplete() to set an early-stop state before the mask was populated. The run output is equivalent to PSF + residual + restore only. Fixed (post-v1): SerialImager.run() now calls update_mask() before the first has_converged() check so that initminorcycle() sees a non-empty mask and cleanComplete() correctly returns False.

3. Phase Timings¶

Phase	Start	End	Duration
Cluster start + partition	06:49:54	06:49:56	~2 s
Parallel imaging (1000 subcubes, 10 workers)	06:49:56	17:15:13	10h 25m 17s
Pre-concat collect + handoff	17:15:13	17:15:49	~36 s
`.image` concat (1000 inputs)	17:15:49	18:20:20	64m 31s
`.residual` concat (1000 inputs)	18:20:41	19:07:31	46m 50s
`.psf` concat (1000 inputs)	19:07:51	19:42:50	34m 59s
`.model` concat (1000 inputs)	19:43:10	19:55:48	12m 38s
`.pb` concat (1000 inputs)	19:56:08	20:21:52	25m 44s
`.mask` concat (1000 inputs)	20:22:12	20:34:57	12m 45s
`.sumwt` concat (1000 inputs)	20:35:19	20:35:45	26s
Cleanup	20:35:45	20:37:27	~1m 42s
Total	06:49:54	20:37:27	13h 47m 33s

Extensions concatenated: 7 (.image.pbcor and .weight not created at niter=0).

Note: .model (all-zero) and .sumwt (tiny: 1×1×1×1000) concat in <13 min and 26 s respectively — pixel volume matters far more than file count.

4. Resource Usage¶

Source: psrecord

4.1 Processes¶

Phase	NProc
Startup	2
Workers launched	13 (main + 10 workers + Dask scheduler)
Post-imaging concat	13 (workers idle but alive)

4.2 Memory¶

Metric	Value
Peak RSS	58.6 GB (during parallel imaging)
Peak virtual	81.6 GB
Peak MMap RSS	52.4 GB (CASA memory-mapped image files)
Swap used	~5 MB (negligible — no thrashing)
RSS during concat	~4–5 GB (workers idle, only coordinator active)
System page cache (start)	122 GB
System page cache (end)	15.4 GB (exhausted by concat I/O)

Peak RSS of 58.6 GB is spread across 10 workers simultaneously, each holding one channel’s worth of PSF + sensitivity store + gridder scratch (~5–6 GB each). Memory drops after imaging completes because Dask reclaims worker process memory.

4.3 CPU¶

Phase	CPU % (aggregate)
Worker ramp-up	900–1000% (all 10 cores)
Sustained imaging	~1000–1095% (near-linear 10-core scaling)
Concat (serial `imageconcat`)	~160% (single-threaded I/O)

Peak CPU of 1095% indicates all 10 workers plus the coordinator were fully active. Concat drops to ~160% because ia.imageconcat() is single-threaded and I/O-bound.

4.4 I/O¶

Metric	Value
Total bytes read	6.71 TB
Total bytes written	9.83 TB
Final output dir size	1.42 TB (7 extensions)
Write amplification	~6.9× final output size

Write amplification breakdown (estimated):

Imaging phase (~2.3 TB): 1000 workers × 9 subcube extensions. Per-channel single plane: 8000 × 8000 × 4 B = 256 MB × 9 extensions. Plus weight scratch files, gridder temp files, per-worker table caches.
Concat phase: reads all 1000 subcube inputs (~2.3 TB) per extension and writes merged output (physical copy doubles I/O).

5. Key Observations¶

5.1 Per-channel parallelism avoids OOM¶

Each Dask worker loads only a single channel at a time (~5–6 GB per worker). This keeps peak RSS at 58.6 GB across 10 workers — 45% less than tclean’s 107.8 GB across 25 MPI ranks which hold all 1000 planes simultaneously.

5.2 Concat was the primary bottleneck¶

At 3h 20m (24% of total wall time), subcube concatenation was the single largest post-imaging overhead. Two root causes:

Serial extension loop — 7 independent extensions processed one at a time, making total time proportional to their sum.
Physical-copy only — ia.imageconcat() was called without forwarding the mode parameter, so every pixel was read and rewritten.

5.3 Page cache exhaustion during concat¶

System page cache collapsed from 122 GB → 15.4 GB as the coordinator sequentially read 1000 subcubes per extension. This is disk-throughput bounded, not CPU bounded.

6. Bottleneck Analysis & Optimization Options¶

Option A — Parallel extension concat (implemented)¶

Use ProcessPoolExecutor (spawn context) to run independent extensions in parallel worker processes. CASA releases the GIL during I/O, so processes can overlap I/O and CPU work effectively. With max_workers=4, estimated wall time drops from ~3h 20m to ~65 min.

Option B — Virtual / reference concat (implemented)¶

Forward mode='nomovevirtual' to ia.imageconcat() to create a lightweight reference catalog instead of copying pixels. Near-instant (<1 min) but requires subcubes to remain on disk (keep_subcubes=True).

Option C — `movevirtual` mode (implemented)¶

mode='movevirtual' renames subcube directories into the output image. Near-instant on the same filesystem. Subcubes are consumed in the process.

Option D — Larger `cube_chunksize`¶

With cube_chunksize=10, 1000 channels → 100 subcubes, reducing file-open overhead by ~10×.

`cube_chunksize`	Subcubes	Est. concat time (paged, serial)
1	1000	~3h 20m
5	200	~1.5–2 h
10	100	~45–60 min
50	20	~10 min

Resolution¶

Options A + B + C have been implemented via the concat_mode parameter in ClusterConfig (values: auto, paged, virtual, movevirtual). See notes/concat_optimize.md for implementation details.

7. Code References¶

src/pclean/utils/image_concat.py — concat_images(), concat_subcubes()
src/pclean/parallel/cube_parallel.py — concat_subcubes() call site + _cleanup_subcubes()
src/pclean/config.py — ClusterConfig.concat_mode