ALMA Cube Imaging — pclean Performance Report (v1)

1. Test Configuration

Parameter

Value

Dataset

ALMA uid___A002_Xf0fd41_X5f5a_target.ms, SPW 25, IRC+10216

Image size

8000 × 8000 px, 0.0046 arcsec/px

Spectral

specmode=cube, 1000 channels

Weighting

briggsbwtaper, robust=0.5

Deconvolver

hogbom, niter=50000 — configured but 0 iterations executed (see note below)

Parallelism

Dask LocalCluster, 10 workers, cube_chunksize=1

Subcubes

1000 (one per channel)

Log files

test_alma_pclean_1.log / test_alma_pclean_1.rec

2. Result Summary

Outcome

Value

Completed

Yes

Total wall time

13h 47m 33s

Imaging wall time

10h 25m 17s

Concat wall time

3h 20m (24% of total)

OOM killed

No

Peak RSS

58.6 GB

Note on deconvolution: niter=50000 was set in the run script but no Hogbom iterations were actually executed. SynthesisDeconvolver::setupDeconvolution was called (hogbom configured), initMinorCycle ran, and setupMask (auto-multithresh) ran. However, executeminorcycle never appears in the log. The first convergence check logged “Peak residual within mask : 0” (empty mask), which likely caused iterbotsink.cleanComplete() to set an early-stop state before the mask was populated. The run output is equivalent to PSF + residual + restore only. Fixed (post-v1): SerialImager.run() now calls update_mask() before the first has_converged() check so that initminorcycle() sees a non-empty mask and cleanComplete() correctly returns False.

3. Phase Timings

Phase

Start

End

Duration

Cluster start + partition

06:49:54

06:49:56

~2 s

Parallel imaging (1000 subcubes, 10 workers)

06:49:56

17:15:13

10h 25m 17s

Pre-concat collect + handoff

17:15:13

17:15:49

~36 s

.image concat (1000 inputs)

17:15:49

18:20:20

64m 31s

.residual concat (1000 inputs)

18:20:41

19:07:31

46m 50s

.psf concat (1000 inputs)

19:07:51

19:42:50

34m 59s

.model concat (1000 inputs)

19:43:10

19:55:48

12m 38s

.pb concat (1000 inputs)

19:56:08

20:21:52

25m 44s

.mask concat (1000 inputs)

20:22:12

20:34:57

12m 45s

.sumwt concat (1000 inputs)

20:35:19

20:35:45

26s

Cleanup

20:35:45

20:37:27

~1m 42s

Total

06:49:54

20:37:27

13h 47m 33s

Extensions concatenated: 7 (.image.pbcor and .weight not created at niter=0).

Note: .model (all-zero) and .sumwt (tiny: 1×1×1×1000) concat in <13 min and 26 s respectively — pixel volume matters far more than file count.

4. Resource Usage

Source: psrecord

4.1 Processes

Phase

NProc

Startup

2

Workers launched

13 (main + 10 workers + Dask scheduler)

Post-imaging concat

13 (workers idle but alive)

4.2 Memory

Metric

Value

Peak RSS

58.6 GB (during parallel imaging)

Peak virtual

81.6 GB

Peak MMap RSS

52.4 GB (CASA memory-mapped image files)

Swap used

~5 MB (negligible — no thrashing)

RSS during concat

~4–5 GB (workers idle, only coordinator active)

System page cache (start)

122 GB

System page cache (end)

15.4 GB (exhausted by concat I/O)

Peak RSS of 58.6 GB is spread across 10 workers simultaneously, each holding one channel’s worth of PSF + sensitivity store + gridder scratch (~5–6 GB each). Memory drops after imaging completes because Dask reclaims worker process memory.

4.3 CPU

Phase

CPU % (aggregate)

Worker ramp-up

900–1000% (all 10 cores)

Sustained imaging

~1000–1095% (near-linear 10-core scaling)

Concat (serial imageconcat)

~160% (single-threaded I/O)

Peak CPU of 1095% indicates all 10 workers plus the coordinator were fully active. Concat drops to ~160% because ia.imageconcat() is single-threaded and I/O-bound.

4.4 I/O

Metric

Value

Total bytes read

6.71 TB

Total bytes written

9.83 TB

Final output dir size

1.42 TB (7 extensions)

Write amplification

~6.9× final output size

Write amplification breakdown (estimated):

  1. Imaging phase (~2.3 TB): 1000 workers × 9 subcube extensions. Per-channel single plane: 8000 × 8000 × 4 B = 256 MB × 9 extensions. Plus weight scratch files, gridder temp files, per-worker table caches.

  2. Concat phase: reads all 1000 subcube inputs (~2.3 TB) per extension and writes merged output (physical copy doubles I/O).

5. Key Observations

5.1 Per-channel parallelism avoids OOM

Each Dask worker loads only a single channel at a time (~5–6 GB per worker). This keeps peak RSS at 58.6 GB across 10 workers — 45% less than tclean’s 107.8 GB across 25 MPI ranks which hold all 1000 planes simultaneously.

5.2 Concat was the primary bottleneck

At 3h 20m (24% of total wall time), subcube concatenation was the single largest post-imaging overhead. Two root causes:

  1. Serial extension loop — 7 independent extensions processed one at a time, making total time proportional to their sum.

  2. Physical-copy onlyia.imageconcat() was called without forwarding the mode parameter, so every pixel was read and rewritten.

5.3 Page cache exhaustion during concat

System page cache collapsed from 122 GB → 15.4 GB as the coordinator sequentially read 1000 subcubes per extension. This is disk-throughput bounded, not CPU bounded.

6. Bottleneck Analysis & Optimization Options

Option A — Parallel extension concat (implemented)

Use ProcessPoolExecutor (spawn context) to run independent extensions in parallel worker processes. CASA releases the GIL during I/O, so processes can overlap I/O and CPU work effectively. With max_workers=4, estimated wall time drops from ~3h 20m to ~65 min.

Option B — Virtual / reference concat (implemented)

Forward mode='nomovevirtual' to ia.imageconcat() to create a lightweight reference catalog instead of copying pixels. Near-instant (<1 min) but requires subcubes to remain on disk (keep_subcubes=True).

Option C — movevirtual mode (implemented)

mode='movevirtual' renames subcube directories into the output image. Near-instant on the same filesystem. Subcubes are consumed in the process.

Option D — Larger cube_chunksize

With cube_chunksize=10, 1000 channels → 100 subcubes, reducing file-open overhead by ~10×.

cube_chunksize

Subcubes

Est. concat time (paged, serial)

1

1000

~3h 20m

5

200

~1.5–2 h

10

100

~45–60 min

50

20

~10 min

Resolution

Options A + B + C have been implemented via the concat_mode parameter in ClusterConfig (values: auto, paged, virtual, movevirtual). See notes/concat_optimize.md for implementation details.

7. Code References

  • src/pclean/utils/image_concat.pyconcat_images(), concat_subcubes()

  • src/pclean/parallel/cube_parallel.pyconcat_subcubes() call site + _cleanup_subcubes()

  • src/pclean/config.pyClusterConfig.concat_mode