ALMA Cube Imaging — pclean Performance Report (v2)¶
1. Test Configuration¶
Parameter |
Value |
|---|---|
Dataset |
ALMA |
Image size |
8000 × 8000 px, 0.0046 arcsec/px |
Spectral |
specmode=cube, 1000 channels, start=268.5 GHz, width=0.244 MHz |
Weighting |
briggsbwtaper, robust=0.5, fracBW=0.000907945 |
Deconvolver |
hogbom, niter=50000, threshold=4.0 mJy |
Masking |
auto-multithresh (sidelobethreshold=2.5, noisethreshold=5.0, lownoisethreshold=1.5, negativethreshold=7.0, minbeamfrac=0.3, growiterations=50) |
Parallelism |
Dask |
Subcubes |
1000 (one per channel) |
Concat |
|
Log files |
|
Script |
|
Changes from v1¶
Aspect |
v1 |
v2 |
|---|---|---|
Deconvolution |
0 iterations (empty-mask bug) |
~204 Hogbom iters / channel (bug fixed) |
Concat mode |
serial, |
parallel ( |
concat extensions |
7 (serial loop) |
9 (7 physical + parallel workers) |
Stopping criterion |
— |
peak residual divergence (3× from minimum) |
Bug fix (post-v1):
SerialImager.run()now callsupdate_mask()before the firsthas_converged()check, soinitminorcycle()sees a populated auto-multithresh mask and deconvolution proceeds correctly.
2. Result Summary¶
Outcome |
Value |
|---|---|
Completed |
Yes |
Total wall time |
68h 22m 49s (246169 s) |
Imaging wall time |
~65h 23m (2026-03-06 18:30:08 → 2026-03-09 11:53:14) |
Concat wall time |
2h 59m 41s (10781 s, 16% of total) |
OOM killed |
No |
Peak RSS |
44.1 GB (during parallel imaging) |
3. Phase Timings¶
Phase |
Start |
End |
Duration |
|---|---|---|---|
Cluster start |
2026-03-06 18:30:06 |
18:30:08 |
~2 s |
Data selection + freq grid |
18:30:08 |
18:33:13 |
~3m 5s |
Parallel imaging (1000 subcubes, 10 workers) |
18:33:13 |
2026-03-09 11:53:14 |
~65h 20m |
|
11:53:14 |
12:17:11 |
23m 57s (1434 s) |
|
12:18:51 |
13:34:07 |
1h 40m 56s (6049 s) |
|
13:35:06 |
14:18:31 |
2h 25m 14s (8714 s)† |
|
14:18:58 |
14:25:15 |
2h 31m 57s (9117 s)† |
|
14:25:59 |
14:26:09 |
7m 37s (457 s)† |
|
12:55:02 |
14:38:27 |
2h 21m 15s (8475 s)† |
|
14:22:23 |
14:52:55 |
1h 18m 48s (4728 s)† |
Total |
2026-03-06 18:30:06 |
2026-03-09 14:52:55 |
68h 22m 49s |
† With max_workers=4, extensions are concatenated in parallel. Wall
times listed are elapsed per extension; the critical-path wall time for
concatenation is the longest extension (.residual, 9117 s). Reported
durations are the per-extension I/O times logged by image_concat, which
overlap because up to 4 run concurrently.
Total concat wall time: 10781 s ≈ 3h 0m (parallel execution, limited by the slowest extension set finishing).
Per-Subcube Timing Breakdown¶
Step |
Min |
Mean |
Median |
Max |
|---|---|---|---|---|
|
210 s |
329 s |
365 s |
471 s |
|
252 s |
285 s |
283 s |
350 s |
deconvolution (minor cycles, in-cycle) |
— |
~1500 s |
— |
— |
|
7 s |
12 s |
11 s |
46 s |
run total |
1596 s |
2342 s |
2431 s |
3267 s |
Each subcube completed exactly 1 major cycle (major_cycle(0)) containing multiple minor-cycle rounds before triggering the stopping criterion.
4. Deconvolution Statistics¶
Metric |
Value |
|---|---|
Minor cycles (executeminorcycle calls) |
2971 across 1000 subcubes |
Mean minor-cycle rounds / subcube |
~3.0 |
Total Hogbom iterations |
203,902 |
Mean Hogbom iters / minor-cycle round |
68.6 |
Mean Hogbom iters / subcube |
~204 |
Convergence: cyclethreshold |
2971 (inner loop stops) |
Convergence: peak residual divergence (3×) |
1000 (outer loop stops) |
Major cycles / subcube |
1 |
All 1000 subcubes terminated via the pclean stopping criterion “peak residual increased by more than 3× from minimum” after ~3 minor-cycle rounds within a single major cycle. This indicates the auto-multithresh mask is allowing deconvolution but the peak residual diverges after a few rounds, consistent with a source that requires careful cleaning depth control.
First-batch deconvolution example (subcube.0–9, channel 0–9):
Round |
Typical iters |
Peak residual (Jy) |
Model flux (Jy) |
Stop reason |
|---|---|---|---|---|
1 |
64–71 |
0.27→0.08 |
0→0.9 |
cyclethreshold |
2 |
54–60 |
0.44→0.13 |
0.9→−0.4 |
cyclethreshold |
3 |
50–54 |
0.72→0.22 |
−0.4→1.5 |
cyclethreshold |
— |
— |
peak > 3× min |
— |
divergence |
5. Resource Usage¶
Source: psrecord (test_alma_pclean_2.rec)
Note: psrecord monitoring covered 2026-03-06 18:30:06 → 2026-03-09 02:39:35 (~32 h), ending during the imaging phase. No psrecord data is available for the final ~1/3 of imaging or for the concat phase.
5.1 Processes¶
Phase |
NProc |
|---|---|
Startup |
2 |
Workers launched |
13 (main + scheduler + 10 workers + nanny) |
Peak NProc |
13 |
5.2 Memory¶
Metric |
Value |
|---|---|
Peak RSS |
44.1 GB (44070 MB) |
Peak virtual |
66.8 GB (68377 MB) |
Peak MMap RSS |
29.2 GB (29872 MB, CASA memory-mapped image files) |
Swap used |
0 (none) |
System page cache (start) |
43.8 GB |
System page cache (end, at monitoring stop) |
54.8 GB |
Peak RSS of 44.1 GB is 25% lower than v1’s 58.6 GB despite deconvolution actually executing. This is within the expected range: the deconvolution inner loop (Hogbom) operates on already-allocated image buffers and does not allocate significant additional memory beyond what make_psf + major_cycle create.
Estimated per-worker RSS: ~4.0–4.4 GB (44 GB / 10 workers + overhead).
5.3 CPU¶
Phase |
CPU % (aggregate) |
|---|---|
Sustained imaging |
~980–1080% (near-linear 10-core scaling) |
Peak CPU |
1078% |
Peak CPU of 1078% is consistent with v1 (1095%), confirming all 10 workers were fully utilized during imaging.
5.4 I/O¶
Metric |
Value |
|---|---|
Total bytes read |
19.0 TB (19,881 GB) |
Total bytes written |
18.1 TB (19,024 GB) |
Final output dir size |
12.5 TB (13,076 GB)† |
Write amplification |
~1.5× final dir size |
† Directory size includes both the 1000 subcubes (keep_subcubes=True) and
the 7 concatenated output extensions — this is much larger than v1’s 1.42 TB
because subcubes are retained.
I/O comparison with v1:
Metric |
v1 |
v2 |
Factor |
|---|---|---|---|
Read |
6.71 TB |
19.0 TB |
2.8× |
Write |
9.83 TB |
18.1 TB |
1.8× |
Dir size |
1.42 TB |
12.5 TB |
8.8׆ |
† v1 deleted subcubes; v2 retained them (keep_subcubes=True).
The higher I/O in v2 is dominated by the deconvolution minor cycles: each minor cycle reads/writes the residual, model, and mask images (3 × 256 MB/plane × ~3 rounds × 1000 channels ≈ 2.3 TB additional). Combined with the 6.5× longer wall time, the sustained I/O bandwidth is similar to v1.
6. Comparison with v1¶
Metric |
v1 |
v2 |
Change |
|---|---|---|---|
Total wall time |
13h 48m |
68h 23m |
5.0× slower |
Imaging wall time |
10h 25m |
~65h 20m |
6.3× slower |
Concat wall time |
3h 20m |
3h 0m |
10% faster |
Deconvolution |
0 iters |
203,902 iters |
∞ (bug fixed) |
Major cycles / subcube |
0 |
1 |
— |
Minor-cycle rounds / subcube |
0 |
~3 |
— |
Peak RSS |
58.6 GB |
44.1 GB |
25% lower |
Peak MMap RSS |
52.4 GB |
29.2 GB |
44% lower |
Peak CPU |
1095% |
1078% |
~same |
Total I/O read |
6.71 TB |
19.0 TB |
2.8× |
Total I/O write |
9.83 TB |
18.1 TB |
1.8× |
Why v2 is 5× slower¶
v1 was effectively PSF+residual+restore only (0 deconvolution iterations). v2 runs full deconvolution: ~3 minor-cycle rounds per subcube, each requiring a full Hogbom iteration loop plus mask evaluation. The per-subcube run time increased from ~375 s (v1) to ~2342 s (v2) — a 6.3× increase, almost entirely attributable to the deconvolution work.
Why peak RSS is lower in v2¶
v1’s 58.6 GB peak was measured with psrecord covering the full run. v2’s 44.1 GB peak was measured over only ~32 h of the ~65 h imaging phase; the true peak may be higher. The deconvolver itself adds negligible allocation — it reuses the already-allocated image stores — so the per-worker footprint is similar in both runs.
Concat improvement¶
With parallel extension concat (max_workers=4), v2 concat completed in
~3 h vs v1’s 3h 20m (serial, 7 extensions in sequence). The per-extension
times are higher in v2 (larger images after deconvolution produced model
content), but parallelism compensates.
7. Key Observations¶
7.1 Deconvolution now executes correctly¶
The v1 empty-mask bug is fixed. All 1000 channels underwent ~3 minor-cycle rounds (203,902 total Hogbom iterations). The auto-multithresh mask successfully identifies emission regions, and the divergence stopping criterion (peak residual > 3× minimum) provides a sensible halt.
7.2 Divergence after ~3 minor cycles suggests under-cleaning¶
All subcubes stop after ~3 minor-cycle rounds due to peak residual divergence (not threshold or niter convergence). This pattern — residual decreasing then increasing — is characteristic of auto-multithresh being too aggressive or the cycle factor being too high. Potential improvements:
Lower
cyclefactorto reduce the per-cycle cleaning depthAdjust
negativethreshold(currently 7.0) to limit divergent behaviorSchedule a second major cycle to narrow the mask after initial cleaning
7.3 Per-channel parallelism still avoids OOM¶
Peak RSS of 44.1 GB across 10 workers (~4.4 GB/worker) remains well within the 128.2 GB system RAM. The deconvolution adds negligible memory overhead since Hogbom operates on already-allocated image buffers in-place.
7.4 Parallel concat effective¶
The max_workers=4 parallel concat completed 7 extensions in ~3 h total
wall time, consistent with the longest single extension (.residual, 9117 s
≈ 2.5 h). This is comparable to v1’s serialized 3h 20m because
per-extension times are longer, but true wall-clock overlap provides a net
improvement.
7.5 psrecord monitoring ended early¶
The .rec file covers only ~32 h of the ~68 h run. For future tests,
ensure psrecord is configured with a sufficiently long duration or no
timeout to capture the full run including concat.
8. Optimization Opportunities¶
8.1 Reduce per-subcube imaging time¶
Per-subcube run time (mean 2342 s) is dominated by deconvolution overhead. Options:
Lower niter/threshold: The current threshold (4.0 mJy) is never reached before divergence. A lower niter limit (e.g., 200) would cap the per-channel work without changing the output quality, since divergence stops it anyway.
Cycle factor tuning: Fewer minor-cycle iterations per major cycle would allow a second major cycle with an updated mask.
8.2 Increase cube_chunksize¶
With cube_chunksize=1, each subcube is a single channel. Increasing to
5–10 would reduce Dask scheduling overhead and I/O setup costs, though
it increases per-worker memory footprint proportionally.
8.3 Virtual concat for intermediate runs¶
For development/iteration runs where subcubes are retained, concat_mode='virtual'
or 'movevirtual' would reduce concat time from ~3 h to <1 min.
9. Code References¶
scripts/test_alma_pclean_v2.py— test scriptsrc/pclean/imaging/serial_imager.py—SerialImager.run()(deconvolution loop, mask fix)src/pclean/utils/image_concat.py—concat_images(), parallel extension concatsrc/pclean/parallel/cube_parallel.py— Dask subcube orchestrationsrc/pclean/config.py—ClusterConfig.concat_mode,keep_subcubes