ALMA Cube Imaging — pclean vs tclean Comparison (v1)¶

Source reports: alma_pclean_perf_v1.md / alma_tclean_perf_v1.md

Caveat: The pclean run used niter=50000 (same as tclean) but no Hogbom iterations were actually executed (confirmed: no executeminorcycle calls appear in the log). The most likely cause is that auto-multithresh generated an empty mask on the very first has_converged() check (logged: “Peak residual within mask : 0”), causing iterbotsink.cleanComplete() to signal early convergence before any CLEAN components could be subtracted. The run output is therefore equivalent to niter=0 (PSF + residual + restore only), but the root cause is a convergence-state issue, not a deliberate niter=0 setting. tclean was also killed before any minor cycle; both runs produced PSF-only output. The comparison is valid for the imaging and memory characteristics but not for deconvolution quality.

1. Run Outcome¶

	tclean	pclean
Completed	No — OOM-killed	Yes
Wall time (before kill / total)	~10h 24m (killed)	13h 47m 33s
Last phase reached	`initMinorCycle` (automask setup)	cleanup
Kill signal	SIGKILL (OOM, rank 1)	—

Takeaway: tclean ran for 10h 24m and failed before executing a single CLEAN iteration. pclean completed the equivalent output (PSF + residual) in 13h 47m.

2. Memory¶

Metric	tclean	pclean	Difference
Peak RSS	154.2 GB (06:59:18)	58.6 GB	pclean −62%
Peak virtual	171.2 GB	81.6 GB	pclean −52%
Peak MMap RSS	146.5 GB	52.4 GB	pclean −64%
Swap used	0 MB	~5 MB	—
Page cache (start)	122.1 GB	122 GB	same
Page cache (end)	7.4 GB (−115 GB)	15.4 GB (−107 GB)	tclean exhausts more

Takeaway: pclean uses ~62% less peak RSS (58.6 vs 154.2 GB). Each Dask worker loads one channel at a time (~5–6 GB), while tclean’s 11 CASA MPI ranks hold all 1000 planes collectively from the start of makePSF. This difference is what separates a successful run from an OOM kill.

Note: tclean’s peak of 154.2 GB occurred at 06:59:18 — just 9 minutes into makePSF — as all weight grids for 1000 channels were allocated upfront. The commonly cited value of 107.8 GB was the post-kill cleanup sample (17:20:01), not the run peak.

3. Phase Timings¶

Phase	tclean	pclean
Startup / setup	~13 s	~2 s
Total parallel imaging	4h 7m 35s (makePSF only¹)	10h 25m 17s (full pipeline²)
Major cycle 1	6h 16m 24s (partial, killed)	N/A (0 iterations executed³)
Subcube concat	N/A (monolithic)	3h 20m (24% of total)
Cleanup	~5 min (post-kill)	~1m 42s

¹ tclean’s 4h 7m 35s is precisely timed from CASA log markers (INFO makePSF → INFO executeMajorCycle).

² pclean’s 10h 25m 17s is the total Dask imaging wall time (06:49:56 → 17:15:13) covering setup, make_psf, make_pb, run_major_cycle (initial residual), and restore — all executed serially within each worker. Per-step breakdown was not logged in the v1 run. Sub-step timing has since been added to SerialImager.run() (INFO lines: make_psf: Xs, make_pb: Xs, etc.) and will be available in future runs.

³ niter=50000 was passed but executeminorcycle never appears in the log. The auto-multithresh mask was empty on the first has_converged() call (“Peak residual within mask : 0”), which likely caused cleanComplete() to set an early-stop state in the iterbotsink. After update_mask() the mask became non-zero (~70–75 mJy peaks), but no Hogbom iterations were dispatched. Fixed (post-v1): SerialImager.run() now calls update_mask() before the first has_converged() check so that initminorcycle() sees a non-empty mask and cleanComplete() correctly returns False.

Takeaway: A direct makePSF-vs-makePSF comparison is not possible from the v1 pclean log. tclean grids all 1000 channels at once across 11 MPI ranks; pclean processes one channel per worker serially — the 10h 25m bound includes all per-channel overhead beyond just PSF computation. pclean’s 3h 20m concat overhead is a known bottleneck addressed via concat_mode (see §6 of alma_pclean_perf_v1.md).

4. CPU¶

Metric	tclean	pclean
Peak CPU	~1105%	~1095%
Imaging phase	~1000–1105% (11 ranks)	~1000–1095% (10 workers)
Concat phase	N/A	~160% (serial, I/O-bound)

Takeaway: Peak CPU is nearly identical — both saturate ~10–11 cores. pclean’s concat phase drops to ~160% because ia.imageconcat() is single-threaded and disk-throughput limited.

5. I/O¶

Metric	tclean	pclean
Total reads	2.33 TB	6.71 TB (×2.9 more)
Total writes	8.24 TB	9.83 TB
Final output size	1.48 TB (partial, killed)	1.42 TB (complete, 7 ext.)
Write amplification	~5.6×	~6.9×

Takeaway: pclean reads nearly 3× more data than tclean. The extra ~4.4 TB of reads comes from the concat phase: each extension requires reading all 1000 subcube inputs (~2.3 TB total) to write the merged output. tclean avoids this because it writes a single monolithic image directly, but pays for it in memory.

pclean’s higher write amplification (~6.9× vs ~5.6×) reflects the same physical-copy concat: subcube pixels are read and rewritten into the merged image. The virtual/movevirtual concat modes eliminate this cost.

6. Parallelism Model¶

	tclean	pclean
Framework	CASA MPI (`parallel=True`)	Dask `LocalCluster`
Worker count	11 CASA ranks	10 Dask workers
Total NProc (psrecord)	25 (includes MPI infrastructure)	13 (main + workers + scheduler)
Memory model	All channels live across all ranks	One channel per worker at a time
Concat	None (monolithic output)	Serial extension loop (addressed)

7. Summary¶

	tclean	pclean
Completed	No	Yes
Peak RSS	154.2 GB	58.6 GB
makePSF speed	Faster (4h 7m, confirmed)	Unknown (10h 25m is full imaging pipeline)
Total wall time	N/A (killed at 10h 24m)	13h 47m (complete)
OOM risk	Fatal at 128 GB RAM	None
Scalability	Limited by $O(\text{nchan})$ memory	$O(1)$ memory per worker

pclean trades PSF speed for memory safety. For a 1000-channel, 8000×8000 cube on a 128 GB node, tclean is simply not viable regardless of tuning. pclean’s concat overhead (the remaining gap) is addressed by concat_mode.

Detailed Reports