ALMA Cube Imaging — tclean Performance Report (v1)¶

1. Test Configuration¶

Parameter	Value
Dataset	ALMA `uid___A002_Xf0fd41_X5f5a_target.ms`, SPW 25, IRC+10216
Image size	8000 × 8000 px, 0.0046 arcsec/px
Spectral	specmode=cube, 1000 channels
Weighting	briggsbwtaper, robust=0.5, fracbw=0.000907945
Deconvolver	hogbom, niter=50000, threshold=4.0mJy, usemask=auto-multithresh
Parallelism	CASA `parallel=True` (MPI), 11 CASA worker ranks (`OMPI_COMM_WORLD_SIZE=11`); psrecord NProc peak 25 (includes MPI infrastructure processes)
CASA version	6.7.3-21
Log files	`test_alma_tclean_1.log` / `test_alma_tclean_1.rec`

2. Result Summary¶

Outcome	Value
Completed	No — OOM-killed
psrecord window	2026-03-04T06:50:14 → 2026-03-04T17:20:01
Elapsed before kill (log)	~10h 24m (06:50:14 → 17:14:26)
Last phase reached	`initMinorCycle` (before any deconvolution)
Kill signal	SIGKILL on MPI rank 1 (signal 9)
Peak RSS	154.2 GB at 06:59:18 (early PSF setup)
Final sample RSS	107.8 GB at 17:20:01 (post-kill cleanup state)

The 50000-iteration deconvolution was never executed.

3. Phase Timings¶

Phase	Start	End	Duration
casahouse init / MPI startup	06:50:14	06:50:25	~11 s
`tclean()` call / setup	06:50:25	06:50:27	~2 s
`makePSF` (1000 ch, parallel)	06:50:27	10:58:02	4h 7m 35s
`executeMajorCycle 1`	10:58:02	17:14:26	6h 16m 24s
`initMinorCycle` / automask setup	17:14:26	17:14:27	< 1 s
SIGKILL (rank 1)	17:14:27	—	OOM-killed
Cleanup / lingering procs	17:14:27	17:20:01	~5 min

4. Resource Usage¶

Source: psrecord

4.1 Processes¶

Metric	Value
CASA worker ranks	11 (`OMPI_COMM_WORLD_SIZE=11`; CASA reports “Processes on node: 11”)
Total NProc (psrecord peak)	25 (includes prte/pmix daemons, mpirun, coordinator)

4.2 Memory¶

Metric	Value	Timestamp
Peak RSS	154.2 GB	2026-03-04T06:59:18 (early PSF setup)
Peak virtual	171.2 GB	2026-03-04T06:59:18
Peak MMap RSS	146.5 GB	2026-03-04T06:59:18
Swap used	0 MB	(no swap at any point)
RSS at last sample (17:20:01)	107.8 GB	post-kill cleanup
Virtual at last sample	126.8 GB	post-kill cleanup
MMap at last sample	81.9 GB	post-kill cleanup
System page cache (start)	122.1 GB	2026-03-04T06:51:39
System page cache (end)	7.4 GB	2026-03-04T17:20:01

Note on peak RSS: The previously reported value of 107.8 GB was the final psrecord sample at 17:20:01 — measured during post-kill cleanup, not during the run. The true peak of 154.2 GB occurred at 06:59:18 during the early PSF weight-grid allocation.

The page cache collapsing from 122 GB → 7 GB confirms severe memory pressure from holding all 1000 channel planes simultaneously across 11 CASA worker ranks.

4.3 CPU¶

Phase	CPU % (aggregate)
Sustained imaging	~1000–1105% (11 CASA ranks; peak 1105.2%)

4.4 I/O¶

Metric	Value	Note
Total bytes read	2.33 TB	decimal (psrecord final)
Total bytes written	8.24 TB	decimal (psrecord final)
Output dir size	1.48 TB	decimal; 1484099 MB at 17:20:01
Write amplification	~5.6× output size	8.24 / 1.48

Write volume breakdown (estimated):

Each channel plane: 8000 × 8000 × 4 B = 256 MB per extension. 1000 channels × ~9 extensions = ~2.3 TB expected final output. Total writes 8.24 TB ≈ 3.5× the expected output, due to:

Weight maps (per-channel BriggsBWTaper weight grids)
Partial gridded visibility scratch data per MPI rank
Repeated residual/model writes across major cycle iterations

5. Key Observations¶

5.1 All 1000 channel planes live in memory simultaneously¶

All 11 CASA worker ranks grid 1000 channels collectively (330 subcubes per pass, as logged: “Subcubes: 330. Processes on node: 11”), requiring the full briggsbwtaper weight grid before any PSF plane is finalized. This drives RSS to 154.2 GB at 06:59:18 — just 9 minutes after makePSF started.

5.2 OOM at minor cycle entry¶

After makePSF (4h 7m) and major cycle 1 (6h 16m), CASA reaches SynthesisDeconvolver::initMinorCycle at 17:14:26. The log shows:

2026-03-04 17:14:26  initMinorCycle  Absolute Peak residual over full image: 0.134128
2026-03-04 17:14:26  setupMask  Setting up an auto-mask
signal 9 (Killed).

The kernel OOM-kills rank 1 during automask setup, before any CLEAN component is subtracted. The page cache had collapsed from 122 GB → 7 GB (−115 GB) by this point.

5.3 I/O write amplification (8.24 TB for 1.48 TB output)¶

Between weight scratch, per-rank partial grids, and per-iteration residual writes, tclean writes ~5.6× more data than the partial output size. This saturates the NVMe subsystem and contributes to the 6h 16m major cycle time.

5.4 No concat phase¶

tclean produces a single monolithic CASA image directly — no subcube concatenation step. However, this comes at the cost of requiring all channel data live in memory simultaneously.

6. Bottleneck Analysis¶

The fundamental bottleneck is memory: tclean’s MPI parallel model holds all 1000 channel planes across all ranks simultaneously, and the minor cycle requires the full residual + PSF + automask structure in memory at once.

No tuning of MPI rank count or I/O can work around this — the monolithic cube model requires $O(\text{nchan})$ memory, which exceeds the system’s 128 GB for this cube size.

7. Comparison with pclean¶

	tclean (MPI, 11 CASA ranks)	pclean (10 Dask workers)
niter	50000 (killed before minor)	0 (makePSF only)
Parallelism	11 CASA MPI ranks (NProc=25 total)	10 Dask workers (NProc=13 total)
makePSF wall time	4h 7m 35s (06:50:27→10:58:02)	≤ 10h 25m total (PSF+save)
Major cycle 1	6h 16m 24s (10:58:02→17:14:26, partial)	N/A
Concat overhead	N/A (monolithic image)	3h 20m (7 extensions, serial)
Peak RSS	154.2 GB (06:59:18)	58.6 GB
Peak virtual	171.2 GB	81.6 GB
Peak MMap RSS	146.5 GB	52.4 GB
Swap	0 MB	~5 MB
Page cache depleted	122 → 7.4 GB	122 → 15.4 GB
Peak CPU	~1105%	~1095%
Total I/O reads	2.33 TB	6.71 TB
Total I/O writes	8.24 TB (incomplete)	9.83 TB (complete)
Final output size	1.48 TB (partial, killed)	1.42 TB (7 extensions, complete)
OOM killed	Yes	No
Run completed	No	Yes (13h 47m)

pclean uses ~63% less peak memory (58.6 GB vs 154.2 GB) because each worker loads only one channel at a time, while tclean holds all 1000 planes collectively.

Note: The pclean run was configured with niter=50000, but converged after 0 minor-cycle iterations (effectively makePSF-only / niter=0 in practice). A strict apples-to-apples comparison would require a pclean run that reaches the same minor-cycle depth as the tclean configuration.

8. Conclusions¶

For a 1000-channel, 8000×8000 cube with briggsbwtaper weighting:

tclean with parallel=True is OOM-killed before completing even one minor cycle
pclean completes PSF across 1000 channels in ~10h without OOM
pclean’s per-channel parallelism trades concat overhead for dramatically lower memory footprint — a bottleneck that has since been addressed (see alma_pclean_perf_v1.md § 6)