ALMA Cube Imaging — tclean Performance Report (v1)¶
1. Test Configuration¶
Parameter |
Value |
|---|---|
Dataset |
ALMA |
Image size |
8000 × 8000 px, 0.0046 arcsec/px |
Spectral |
specmode=cube, 1000 channels |
Weighting |
briggsbwtaper, robust=0.5, fracbw=0.000907945 |
Deconvolver |
hogbom, niter=50000, threshold=4.0mJy, usemask=auto-multithresh |
Parallelism |
CASA |
CASA version |
6.7.3-21 |
Log files |
|
2. Result Summary¶
Outcome |
Value |
|---|---|
Completed |
No — OOM-killed |
psrecord window |
2026-03-04T06:50:14 → 2026-03-04T17:20:01 |
Elapsed before kill (log) |
~10h 24m (06:50:14 → 17:14:26) |
Last phase reached |
|
Kill signal |
SIGKILL on MPI rank 1 (signal 9) |
Peak RSS |
154.2 GB at 06:59:18 (early PSF setup) |
Final sample RSS |
107.8 GB at 17:20:01 (post-kill cleanup state) |
The 50000-iteration deconvolution was never executed.
3. Phase Timings¶
Phase |
Start |
End |
Duration |
|---|---|---|---|
casahouse init / MPI startup |
06:50:14 |
06:50:25 |
~11 s |
|
06:50:25 |
06:50:27 |
~2 s |
|
06:50:27 |
10:58:02 |
4h 7m 35s |
|
10:58:02 |
17:14:26 |
6h 16m 24s |
|
17:14:26 |
17:14:27 |
< 1 s |
SIGKILL (rank 1) |
17:14:27 |
— |
OOM-killed |
Cleanup / lingering procs |
17:14:27 |
17:20:01 |
~5 min |
4. Resource Usage¶
Source: psrecord
4.1 Processes¶
Metric |
Value |
|---|---|
CASA worker ranks |
11 ( |
Total NProc (psrecord peak) |
25 (includes prte/pmix daemons, mpirun, coordinator) |
4.2 Memory¶
Metric |
Value |
Timestamp |
|---|---|---|
Peak RSS |
154.2 GB |
2026-03-04T06:59:18 (early PSF setup) |
Peak virtual |
171.2 GB |
2026-03-04T06:59:18 |
Peak MMap RSS |
146.5 GB |
2026-03-04T06:59:18 |
Swap used |
0 MB |
(no swap at any point) |
RSS at last sample (17:20:01) |
107.8 GB |
post-kill cleanup |
Virtual at last sample |
126.8 GB |
post-kill cleanup |
MMap at last sample |
81.9 GB |
post-kill cleanup |
System page cache (start) |
122.1 GB |
2026-03-04T06:51:39 |
System page cache (end) |
7.4 GB |
2026-03-04T17:20:01 |
Note on peak RSS: The previously reported value of 107.8 GB was the final psrecord sample at 17:20:01 — measured during post-kill cleanup, not during the run. The true peak of 154.2 GB occurred at 06:59:18 during the early PSF weight-grid allocation.
The page cache collapsing from 122 GB → 7 GB confirms severe memory pressure from holding all 1000 channel planes simultaneously across 11 CASA worker ranks.
4.3 CPU¶
Phase |
CPU % (aggregate) |
|---|---|
Sustained imaging |
~1000–1105% (11 CASA ranks; peak 1105.2%) |
4.4 I/O¶
Metric |
Value |
Note |
|---|---|---|
Total bytes read |
2.33 TB |
decimal (psrecord final) |
Total bytes written |
8.24 TB |
decimal (psrecord final) |
Output dir size |
1.48 TB |
decimal; 1484099 MB at 17:20:01 |
Write amplification |
~5.6× output size |
8.24 / 1.48 |
Write volume breakdown (estimated):
Each channel plane: 8000 × 8000 × 4 B = 256 MB per extension. 1000 channels × ~9 extensions = ~2.3 TB expected final output. Total writes 8.24 TB ≈ 3.5× the expected output, due to:
Weight maps (per-channel BriggsBWTaper weight grids)
Partial gridded visibility scratch data per MPI rank
Repeated residual/model writes across major cycle iterations
5. Key Observations¶
5.1 All 1000 channel planes live in memory simultaneously¶
All 11 CASA worker ranks grid 1000 channels collectively (330 subcubes per
pass, as logged: “Subcubes: 330. Processes on node: 11”), requiring the
full briggsbwtaper weight grid before any PSF plane is finalized. This
drives RSS to 154.2 GB at 06:59:18 — just 9 minutes after makePSF
started.
5.2 OOM at minor cycle entry¶
After makePSF (4h 7m) and major cycle 1 (6h 16m), CASA reaches
SynthesisDeconvolver::initMinorCycle at 17:14:26. The log shows:
2026-03-04 17:14:26 initMinorCycle Absolute Peak residual over full image: 0.134128
2026-03-04 17:14:26 setupMask Setting up an auto-mask
signal 9 (Killed).
The kernel OOM-kills rank 1 during automask setup, before any CLEAN component is subtracted. The page cache had collapsed from 122 GB → 7 GB (−115 GB) by this point.
5.3 I/O write amplification (8.24 TB for 1.48 TB output)¶
Between weight scratch, per-rank partial grids, and per-iteration residual writes, tclean writes ~5.6× more data than the partial output size. This saturates the NVMe subsystem and contributes to the 6h 16m major cycle time.
5.4 No concat phase¶
tclean produces a single monolithic CASA image directly — no subcube concatenation step. However, this comes at the cost of requiring all channel data live in memory simultaneously.
6. Bottleneck Analysis¶
The fundamental bottleneck is memory: tclean’s MPI parallel model holds all 1000 channel planes across all ranks simultaneously, and the minor cycle requires the full residual + PSF + automask structure in memory at once.
No tuning of MPI rank count or I/O can work around this — the monolithic cube model requires $O(\text{nchan})$ memory, which exceeds the system’s 128 GB for this cube size.
7. Comparison with pclean¶
tclean (MPI, 11 CASA ranks) |
pclean (10 Dask workers) |
|
|---|---|---|
niter |
50000 (killed before minor) |
0 (makePSF only) |
Parallelism |
11 CASA MPI ranks (NProc=25 total) |
10 Dask workers (NProc=13 total) |
makePSF wall time |
4h 7m 35s (06:50:27→10:58:02) |
≤ 10h 25m total (PSF+save) |
Major cycle 1 |
6h 16m 24s (10:58:02→17:14:26, partial) |
N/A |
Concat overhead |
N/A (monolithic image) |
3h 20m (7 extensions, serial) |
Peak RSS |
154.2 GB (06:59:18) |
58.6 GB |
Peak virtual |
171.2 GB |
81.6 GB |
Peak MMap RSS |
146.5 GB |
52.4 GB |
Swap |
0 MB |
~5 MB |
Page cache depleted |
122 → 7.4 GB |
122 → 15.4 GB |
Peak CPU |
~1105% |
~1095% |
Total I/O reads |
2.33 TB |
6.71 TB |
Total I/O writes |
8.24 TB (incomplete) |
9.83 TB (complete) |
Final output size |
1.48 TB (partial, killed) |
1.42 TB (7 extensions, complete) |
OOM killed |
Yes |
No |
Run completed |
No |
Yes (13h 47m) |
pclean uses ~63% less peak memory (58.6 GB vs 154.2 GB) because each worker loads only one channel at a time, while tclean holds all 1000 planes collectively.
Note: The pclean run was configured with
niter=50000, but converged after 0 minor-cycle iterations (effectively makePSF-only /niter=0in practice). A strict apples-to-apples comparison would require a pclean run that reaches the same minor-cycle depth as the tclean configuration.
8. Conclusions¶
For a 1000-channel, 8000×8000 cube with briggsbwtaper weighting:
tclean with
parallel=Trueis OOM-killed before completing even one minor cyclepclean completes PSF across 1000 channels in ~10h without OOM
pclean’s per-channel parallelism trades concat overhead for dramatically lower memory footprint — a bottleneck that has since been addressed (see
alma_pclean_perf_v1.md§ 6)