ALMA Cube Imaging — pclean vs tclean Comparison (v1)¶
Source reports: alma_pclean_perf_v1.md / alma_tclean_perf_v1.md
Caveat: The pclean run used
niter=50000(same as tclean) but no Hogbom iterations were actually executed (confirmed: noexecuteminorcyclecalls appear in the log). The most likely cause is thatauto-multithreshgenerated an empty mask on the very firsthas_converged()check (logged: “Peak residual within mask : 0”), causingiterbotsink.cleanComplete()to signal early convergence before any CLEAN components could be subtracted. The run output is therefore equivalent toniter=0(PSF + residual + restore only), but the root cause is a convergence-state issue, not a deliberateniter=0setting. tclean was also killed before any minor cycle; both runs produced PSF-only output. The comparison is valid for the imaging and memory characteristics but not for deconvolution quality.
1. Run Outcome¶
tclean |
pclean |
|
|---|---|---|
Completed |
No — OOM-killed |
Yes |
Wall time (before kill / total) |
~10h 24m (killed) |
13h 47m 33s |
Last phase reached |
|
cleanup |
Kill signal |
SIGKILL (OOM, rank 1) |
— |
Takeaway: tclean ran for 10h 24m and failed before executing a single CLEAN iteration. pclean completed the equivalent output (PSF + residual) in 13h 47m.
2. Memory¶
Metric |
tclean |
pclean |
Difference |
|---|---|---|---|
Peak RSS |
154.2 GB (06:59:18) |
58.6 GB |
pclean −62% |
Peak virtual |
171.2 GB |
81.6 GB |
pclean −52% |
Peak MMap RSS |
146.5 GB |
52.4 GB |
pclean −64% |
Swap used |
0 MB |
~5 MB |
— |
Page cache (start) |
122.1 GB |
122 GB |
same |
Page cache (end) |
7.4 GB (−115 GB) |
15.4 GB (−107 GB) |
tclean exhausts more |
Takeaway: pclean uses ~62% less peak RSS (58.6 vs 154.2 GB). Each Dask
worker loads one channel at a time (~5–6 GB), while tclean’s 11 CASA MPI ranks
hold all 1000 planes collectively from the start of makePSF. This difference
is what separates a successful run from an OOM kill.
Note: tclean’s peak of 154.2 GB occurred at 06:59:18 — just 9 minutes into
makePSF— as all weight grids for 1000 channels were allocated upfront. The commonly cited value of 107.8 GB was the post-kill cleanup sample (17:20:01), not the run peak.
3. Phase Timings¶
Phase |
tclean |
pclean |
|---|---|---|
Startup / setup |
~13 s |
~2 s |
Total parallel imaging |
4h 7m 35s (makePSF only¹) |
10h 25m 17s (full pipeline²) |
Major cycle 1 |
6h 16m 24s (partial, killed) |
N/A (0 iterations executed³) |
Subcube concat |
N/A (monolithic) |
3h 20m (24% of total) |
Cleanup |
~5 min (post-kill) |
~1m 42s |
¹ tclean’s 4h 7m 35s is precisely timed from CASA log markers (INFO makePSF
→ INFO executeMajorCycle).
² pclean’s 10h 25m 17s is the total Dask imaging wall time (06:49:56 →
17:15:13) covering setup, make_psf, make_pb, run_major_cycle (initial
residual), and restore — all executed serially within each worker. Per-step
breakdown was not logged in the v1 run. Sub-step timing has since been added
to SerialImager.run() (INFO lines: make_psf: Xs, make_pb: Xs, etc.) and
will be available in future runs.
³ niter=50000 was passed but executeminorcycle never appears in the log.
The auto-multithresh mask was empty on the first has_converged() call
(“Peak residual within mask : 0”), which likely caused cleanComplete() to
set an early-stop state in the iterbotsink. After update_mask() the mask
became non-zero (~70–75 mJy peaks), but no Hogbom iterations were dispatched.
Fixed (post-v1): SerialImager.run() now calls update_mask() before
the first has_converged() check so that initminorcycle() sees a
non-empty mask and cleanComplete() correctly returns False.
Takeaway: A direct makePSF-vs-makePSF comparison is not possible from the
v1 pclean log. tclean grids all 1000 channels at once across 11 MPI ranks;
pclean processes one channel per worker serially — the 10h 25m bound includes
all per-channel overhead beyond just PSF computation. pclean’s 3h 20m concat
overhead is a known bottleneck addressed via concat_mode
(see §6 of alma_pclean_perf_v1.md).
4. CPU¶
Metric |
tclean |
pclean |
|---|---|---|
Peak CPU |
~1105% |
~1095% |
Imaging phase |
~1000–1105% (11 ranks) |
~1000–1095% (10 workers) |
Concat phase |
N/A |
~160% (serial, I/O-bound) |
Takeaway: Peak CPU is nearly identical — both saturate ~10–11 cores.
pclean’s concat phase drops to ~160% because ia.imageconcat() is
single-threaded and disk-throughput limited.
5. I/O¶
Metric |
tclean |
pclean |
|---|---|---|
Total reads |
2.33 TB |
6.71 TB (×2.9 more) |
Total writes |
8.24 TB |
9.83 TB |
Final output size |
1.48 TB (partial, killed) |
1.42 TB (complete, 7 ext.) |
Write amplification |
~5.6× |
~6.9× |
Takeaway: pclean reads nearly 3× more data than tclean. The extra ~4.4 TB of reads comes from the concat phase: each extension requires reading all 1000 subcube inputs (~2.3 TB total) to write the merged output. tclean avoids this because it writes a single monolithic image directly, but pays for it in memory.
pclean’s higher write amplification (~6.9× vs ~5.6×) reflects the same
physical-copy concat: subcube pixels are read and rewritten into the merged
image. The virtual/movevirtual concat modes eliminate this cost.
6. Parallelism Model¶
tclean |
pclean |
|
|---|---|---|
Framework |
CASA MPI ( |
Dask |
Worker count |
11 CASA ranks |
10 Dask workers |
Total NProc (psrecord) |
25 (includes MPI infrastructure) |
13 (main + workers + scheduler) |
Memory model |
All channels live across all ranks |
One channel per worker at a time |
Concat |
None (monolithic output) |
Serial extension loop (addressed) |
7. Summary¶
tclean |
pclean |
|
|---|---|---|
Completed |
No |
Yes |
Peak RSS |
154.2 GB |
58.6 GB |
makePSF speed |
Faster (4h 7m, confirmed) |
Unknown (10h 25m is full imaging pipeline) |
Total wall time |
N/A (killed at 10h 24m) |
13h 47m (complete) |
OOM risk |
Fatal at 128 GB RAM |
None |
Scalability |
Limited by $O(\text{nchan})$ memory |
$O(1)$ memory per worker |
pclean trades PSF speed for memory safety. For a 1000-channel, 8000×8000
cube on a 128 GB node, tclean is simply not viable regardless of tuning.
pclean’s concat overhead (the remaining gap) is addressed by concat_mode.