ALMA Cube Imaging — pclean vs tclean Comparison (v1)

Source reports: alma_pclean_perf_v1.md / alma_tclean_perf_v1.md

Caveat: The pclean run used niter=50000 (same as tclean) but no Hogbom iterations were actually executed (confirmed: no executeminorcycle calls appear in the log). The most likely cause is that auto-multithresh generated an empty mask on the very first has_converged() check (logged: “Peak residual within mask : 0”), causing iterbotsink.cleanComplete() to signal early convergence before any CLEAN components could be subtracted. The run output is therefore equivalent to niter=0 (PSF + residual + restore only), but the root cause is a convergence-state issue, not a deliberate niter=0 setting. tclean was also killed before any minor cycle; both runs produced PSF-only output. The comparison is valid for the imaging and memory characteristics but not for deconvolution quality.


1. Run Outcome

tclean

pclean

Completed

No — OOM-killed

Yes

Wall time (before kill / total)

~10h 24m (killed)

13h 47m 33s

Last phase reached

initMinorCycle (automask setup)

cleanup

Kill signal

SIGKILL (OOM, rank 1)

Takeaway: tclean ran for 10h 24m and failed before executing a single CLEAN iteration. pclean completed the equivalent output (PSF + residual) in 13h 47m.


2. Memory

Metric

tclean

pclean

Difference

Peak RSS

154.2 GB (06:59:18)

58.6 GB

pclean −62%

Peak virtual

171.2 GB

81.6 GB

pclean −52%

Peak MMap RSS

146.5 GB

52.4 GB

pclean −64%

Swap used

0 MB

~5 MB

Page cache (start)

122.1 GB

122 GB

same

Page cache (end)

7.4 GB (−115 GB)

15.4 GB (−107 GB)

tclean exhausts more

Takeaway: pclean uses ~62% less peak RSS (58.6 vs 154.2 GB). Each Dask worker loads one channel at a time (~5–6 GB), while tclean’s 11 CASA MPI ranks hold all 1000 planes collectively from the start of makePSF. This difference is what separates a successful run from an OOM kill.

Note: tclean’s peak of 154.2 GB occurred at 06:59:18 — just 9 minutes into makePSF — as all weight grids for 1000 channels were allocated upfront. The commonly cited value of 107.8 GB was the post-kill cleanup sample (17:20:01), not the run peak.


3. Phase Timings

Phase

tclean

pclean

Startup / setup

~13 s

~2 s

Total parallel imaging

4h 7m 35s (makePSF only¹)

10h 25m 17s (full pipeline²)

Major cycle 1

6h 16m 24s (partial, killed)

N/A (0 iterations executed³)

Subcube concat

N/A (monolithic)

3h 20m (24% of total)

Cleanup

~5 min (post-kill)

~1m 42s

¹ tclean’s 4h 7m 35s is precisely timed from CASA log markers (INFO makePSFINFO executeMajorCycle).

² pclean’s 10h 25m 17s is the total Dask imaging wall time (06:49:56 → 17:15:13) covering setup, make_psf, make_pb, run_major_cycle (initial residual), and restore — all executed serially within each worker. Per-step breakdown was not logged in the v1 run. Sub-step timing has since been added to SerialImager.run() (INFO lines: make_psf: Xs, make_pb: Xs, etc.) and will be available in future runs.

³ niter=50000 was passed but executeminorcycle never appears in the log. The auto-multithresh mask was empty on the first has_converged() call (“Peak residual within mask : 0”), which likely caused cleanComplete() to set an early-stop state in the iterbotsink. After update_mask() the mask became non-zero (~70–75 mJy peaks), but no Hogbom iterations were dispatched. Fixed (post-v1): SerialImager.run() now calls update_mask() before the first has_converged() check so that initminorcycle() sees a non-empty mask and cleanComplete() correctly returns False.

Takeaway: A direct makePSF-vs-makePSF comparison is not possible from the v1 pclean log. tclean grids all 1000 channels at once across 11 MPI ranks; pclean processes one channel per worker serially — the 10h 25m bound includes all per-channel overhead beyond just PSF computation. pclean’s 3h 20m concat overhead is a known bottleneck addressed via concat_mode (see §6 of alma_pclean_perf_v1.md).


4. CPU

Metric

tclean

pclean

Peak CPU

~1105%

~1095%

Imaging phase

~1000–1105% (11 ranks)

~1000–1095% (10 workers)

Concat phase

N/A

~160% (serial, I/O-bound)

Takeaway: Peak CPU is nearly identical — both saturate ~10–11 cores. pclean’s concat phase drops to ~160% because ia.imageconcat() is single-threaded and disk-throughput limited.


5. I/O

Metric

tclean

pclean

Total reads

2.33 TB

6.71 TB (×2.9 more)

Total writes

8.24 TB

9.83 TB

Final output size

1.48 TB (partial, killed)

1.42 TB (complete, 7 ext.)

Write amplification

~5.6×

~6.9×

Takeaway: pclean reads nearly 3× more data than tclean. The extra ~4.4 TB of reads comes from the concat phase: each extension requires reading all 1000 subcube inputs (~2.3 TB total) to write the merged output. tclean avoids this because it writes a single monolithic image directly, but pays for it in memory.

pclean’s higher write amplification (~6.9× vs ~5.6×) reflects the same physical-copy concat: subcube pixels are read and rewritten into the merged image. The virtual/movevirtual concat modes eliminate this cost.


6. Parallelism Model

tclean

pclean

Framework

CASA MPI (parallel=True)

Dask LocalCluster

Worker count

11 CASA ranks

10 Dask workers

Total NProc (psrecord)

25 (includes MPI infrastructure)

13 (main + workers + scheduler)

Memory model

All channels live across all ranks

One channel per worker at a time

Concat

None (monolithic output)

Serial extension loop (addressed)


7. Summary

tclean

pclean

Completed

No

Yes

Peak RSS

154.2 GB

58.6 GB

makePSF speed

Faster (4h 7m, confirmed)

Unknown (10h 25m is full imaging pipeline)

Total wall time

N/A (killed at 10h 24m)

13h 47m (complete)

OOM risk

Fatal at 128 GB RAM

None

Scalability

Limited by $O(\text{nchan})$ memory

$O(1)$ memory per worker

pclean trades PSF speed for memory safety. For a 1000-channel, 8000×8000 cube on a 128 GB node, tclean is simply not viable regardless of tuning. pclean’s concat overhead (the remaining gap) is addressed by concat_mode.