ALMA Cube Imaging — pclean Performance Report (v1)¶
1. Test Configuration¶
Parameter |
Value |
|---|---|
Dataset |
ALMA |
Image size |
8000 × 8000 px, 0.0046 arcsec/px |
Spectral |
specmode=cube, 1000 channels |
Weighting |
briggsbwtaper, robust=0.5 |
Deconvolver |
hogbom, niter=50000 — configured but 0 iterations executed (see note below) |
Parallelism |
Dask |
Subcubes |
1000 (one per channel) |
Log files |
|
2. Result Summary¶
Outcome |
Value |
|---|---|
Completed |
Yes |
Total wall time |
13h 47m 33s |
Imaging wall time |
10h 25m 17s |
Concat wall time |
3h 20m (24% of total) |
OOM killed |
No |
Peak RSS |
58.6 GB |
Note on deconvolution:
niter=50000was set in the run script but no Hogbom iterations were actually executed.SynthesisDeconvolver::setupDeconvolutionwas called (hogbom configured),initMinorCycleran, andsetupMask(auto-multithresh) ran. However,executeminorcyclenever appears in the log. The first convergence check logged “Peak residual within mask : 0” (empty mask), which likely causediterbotsink.cleanComplete()to set an early-stop state before the mask was populated. The run output is equivalent to PSF + residual + restore only. Fixed (post-v1):SerialImager.run()now callsupdate_mask()before the firsthas_converged()check so thatinitminorcycle()sees a non-empty mask andcleanComplete()correctly returnsFalse.
3. Phase Timings¶
Phase |
Start |
End |
Duration |
|---|---|---|---|
Cluster start + partition |
06:49:54 |
06:49:56 |
~2 s |
Parallel imaging (1000 subcubes, 10 workers) |
06:49:56 |
17:15:13 |
10h 25m 17s |
Pre-concat collect + handoff |
17:15:13 |
17:15:49 |
~36 s |
|
17:15:49 |
18:20:20 |
64m 31s |
|
18:20:41 |
19:07:31 |
46m 50s |
|
19:07:51 |
19:42:50 |
34m 59s |
|
19:43:10 |
19:55:48 |
12m 38s |
|
19:56:08 |
20:21:52 |
25m 44s |
|
20:22:12 |
20:34:57 |
12m 45s |
|
20:35:19 |
20:35:45 |
26s |
Cleanup |
20:35:45 |
20:37:27 |
~1m 42s |
Total |
06:49:54 |
20:37:27 |
13h 47m 33s |
Extensions concatenated: 7 (.image.pbcor and .weight not created at niter=0).
Note: .model (all-zero) and .sumwt (tiny: 1×1×1×1000) concat in <13 min
and 26 s respectively — pixel volume matters far more than file count.
4. Resource Usage¶
Source: psrecord
4.1 Processes¶
Phase |
NProc |
|---|---|
Startup |
2 |
Workers launched |
13 (main + 10 workers + Dask scheduler) |
Post-imaging concat |
13 (workers idle but alive) |
4.2 Memory¶
Metric |
Value |
|---|---|
Peak RSS |
58.6 GB (during parallel imaging) |
Peak virtual |
81.6 GB |
Peak MMap RSS |
52.4 GB (CASA memory-mapped image files) |
Swap used |
~5 MB (negligible — no thrashing) |
RSS during concat |
~4–5 GB (workers idle, only coordinator active) |
System page cache (start) |
122 GB |
System page cache (end) |
15.4 GB (exhausted by concat I/O) |
Peak RSS of 58.6 GB is spread across 10 workers simultaneously, each holding one channel’s worth of PSF + sensitivity store + gridder scratch (~5–6 GB each). Memory drops after imaging completes because Dask reclaims worker process memory.
4.3 CPU¶
Phase |
CPU % (aggregate) |
|---|---|
Worker ramp-up |
900–1000% (all 10 cores) |
Sustained imaging |
~1000–1095% (near-linear 10-core scaling) |
Concat (serial |
~160% (single-threaded I/O) |
Peak CPU of 1095% indicates all 10 workers plus the coordinator were fully
active. Concat drops to ~160% because ia.imageconcat() is single-threaded
and I/O-bound.
4.4 I/O¶
Metric |
Value |
|---|---|
Total bytes read |
6.71 TB |
Total bytes written |
9.83 TB |
Final output dir size |
1.42 TB (7 extensions) |
Write amplification |
~6.9× final output size |
Write amplification breakdown (estimated):
Imaging phase (~2.3 TB): 1000 workers × 9 subcube extensions. Per-channel single plane: 8000 × 8000 × 4 B = 256 MB × 9 extensions. Plus weight scratch files, gridder temp files, per-worker table caches.
Concat phase: reads all 1000 subcube inputs (~2.3 TB) per extension and writes merged output (physical copy doubles I/O).
5. Key Observations¶
5.1 Per-channel parallelism avoids OOM¶
Each Dask worker loads only a single channel at a time (~5–6 GB per worker). This keeps peak RSS at 58.6 GB across 10 workers — 45% less than tclean’s 107.8 GB across 25 MPI ranks which hold all 1000 planes simultaneously.
5.2 Concat was the primary bottleneck¶
At 3h 20m (24% of total wall time), subcube concatenation was the single largest post-imaging overhead. Two root causes:
Serial extension loop — 7 independent extensions processed one at a time, making total time proportional to their sum.
Physical-copy only —
ia.imageconcat()was called without forwarding themodeparameter, so every pixel was read and rewritten.
5.3 Page cache exhaustion during concat¶
System page cache collapsed from 122 GB → 15.4 GB as the coordinator sequentially read 1000 subcubes per extension. This is disk-throughput bounded, not CPU bounded.
6. Bottleneck Analysis & Optimization Options¶
Option A — Parallel extension concat (implemented)¶
Use ProcessPoolExecutor (spawn context) to run independent extensions in parallel worker processes.
CASA releases the GIL during I/O, so processes can overlap I/O and CPU work effectively.
With max_workers=4, estimated wall time drops from ~3h 20m to ~65 min.
Option B — Virtual / reference concat (implemented)¶
Forward mode='nomovevirtual' to ia.imageconcat() to create a lightweight
reference catalog instead of copying pixels. Near-instant (<1 min) but
requires subcubes to remain on disk (keep_subcubes=True).
Option C — movevirtual mode (implemented)¶
mode='movevirtual' renames subcube directories into the output image.
Near-instant on the same filesystem. Subcubes are consumed in the process.
Option D — Larger cube_chunksize¶
With cube_chunksize=10, 1000 channels → 100 subcubes, reducing file-open
overhead by ~10×.
|
Subcubes |
Est. concat time (paged, serial) |
|---|---|---|
1 |
1000 |
~3h 20m |
5 |
200 |
~1.5–2 h |
10 |
100 |
~45–60 min |
50 |
20 |
~10 min |
Resolution¶
Options A + B + C have been implemented via the concat_mode parameter
in ClusterConfig (values: auto, paged, virtual, movevirtual).
See notes/concat_optimize.md for implementation details.
7. Code References¶
src/pclean/utils/image_concat.py—concat_images(),concat_subcubes()src/pclean/parallel/cube_parallel.py—concat_subcubes()call site +_cleanup_subcubes()src/pclean/config.py—ClusterConfig.concat_mode