# ALMA Cube Imaging — tclean Performance Report (v1)

## 1. Test Configuration

| Parameter | Value |
|---|---|
| Dataset | ALMA `uid___A002_Xf0fd41_X5f5a_target.ms`, SPW 25, IRC+10216 |
| Image size | 8000 × 8000 px, 0.0046 arcsec/px |
| Spectral | specmode=cube, 1000 channels |
| Weighting | briggsbwtaper, robust=0.5, fracbw=0.000907945 |
| Deconvolver | hogbom, niter=50000, threshold=4.0mJy, usemask=auto-multithresh |
| Parallelism | CASA `parallel=True` (MPI), 11 CASA worker ranks (`OMPI_COMM_WORLD_SIZE=11`); psrecord NProc peak 25 (includes MPI infrastructure processes) |
| CASA version | 6.7.3-21 |
| Log files | `test_alma_tclean_1.log` / `test_alma_tclean_1.rec` |

## 2. Result Summary

| Outcome | Value |
|---|---|
| **Completed** | **No — OOM-killed** |
| psrecord window | 2026-03-04T06:50:14 → 2026-03-04T17:20:01 |
| Elapsed before kill (log) | ~10h 24m (06:50:14 → 17:14:26) |
| Last phase reached | `initMinorCycle` (before any deconvolution) |
| Kill signal | SIGKILL on MPI rank 1 (signal 9) |
| **Peak RSS** | **154.2 GB** at 06:59:18 (early PSF setup) |
| Final sample RSS | 107.8 GB at 17:20:01 (post-kill cleanup state) |

The 50000-iteration deconvolution was **never executed**.

## 3. Phase Timings

| Phase | Start | End | Duration |
|---|---|---|---|
| casahouse init / MPI startup | 06:50:14 | 06:50:25 | ~11 s |
| `tclean()` call / setup | 06:50:25 | 06:50:27 | ~2 s |
| `makePSF` (1000 ch, parallel) | 06:50:27 | 10:58:02 | **4h 7m 35s** |
| `executeMajorCycle 1` | 10:58:02 | 17:14:26 | **6h 16m 24s** |
| `initMinorCycle` / automask setup | 17:14:26 | 17:14:27 | < 1 s |
| **SIGKILL (rank 1)** | 17:14:27 | — | **OOM-killed** |
| Cleanup / lingering procs | 17:14:27 | 17:20:01 | ~5 min |

## 4. Resource Usage

*Source: psrecord*

### 4.1 Processes

| Metric | Value |
|---|---|
| CASA worker ranks | 11 (`OMPI_COMM_WORLD_SIZE=11`; CASA reports "Processes on node: 11") |
| Total NProc (psrecord peak) | 25 (includes prte/pmix daemons, mpirun, coordinator) |

### 4.2 Memory

| Metric | Value | Timestamp |
|---|---|---|
| **Peak RSS** | **154.2 GB** | 2026-03-04T06:59:18 (early PSF setup) |
| **Peak virtual** | **171.2 GB** | 2026-03-04T06:59:18 |
| **Peak MMap RSS** | **146.5 GB** | 2026-03-04T06:59:18 |
| Swap used | **0 MB** | (no swap at any point) |
| RSS at last sample (17:20:01) | 107.8 GB | post-kill cleanup |
| Virtual at last sample | 126.8 GB | post-kill cleanup |
| MMap at last sample | 81.9 GB | post-kill cleanup |
| System page cache (start) | 122.1 GB | 2026-03-04T06:51:39 |
| System page cache (end) | **7.4 GB** | 2026-03-04T17:20:01 |

> **Note on peak RSS**: The previously reported value of 107.8 GB was the
> *final* psrecord sample at 17:20:01 — measured during post-kill cleanup,
> not during the run.  The true peak of **154.2 GB** occurred at 06:59:18
> during the early PSF weight-grid allocation.

The page cache collapsing from 122 GB → 7 GB confirms severe memory pressure
from holding all 1000 channel planes simultaneously across 11 CASA worker ranks.

### 4.3 CPU

| Phase | CPU % (aggregate) |
|---|---|
| Sustained imaging | ~1000–1105% (11 CASA ranks; peak 1105.2%) |

### 4.4 I/O

| Metric | Value | Note |
|---|---|---|
| Total bytes read | **2.33 TB** | decimal (psrecord final) |
| Total bytes written | **8.24 TB** | decimal (psrecord final) |
| Output dir size | **1.48 TB** | decimal; 1484099 MB at 17:20:01 |
| Write amplification | **~5.6×** output size | 8.24 / 1.48 |

**Write volume breakdown (estimated):**

Each channel plane: 8000 × 8000 × 4 B = 256 MB per extension.
1000 channels × ~9 extensions = ~2.3 TB expected final output.
Total writes 8.24 TB ≈ 3.5× the expected output, due to:

- Weight maps (per-channel BriggsBWTaper weight grids)
- Partial gridded visibility scratch data per MPI rank
- Repeated residual/model writes across major cycle iterations

## 5. Key Observations

### 5.1 All 1000 channel planes live in memory simultaneously

All 11 CASA worker ranks grid 1000 channels collectively (330 subcubes per
pass, as logged: *"Subcubes: 330. Processes on node: 11"*), requiring the
full `briggsbwtaper` weight grid before any PSF plane is finalized.  This
drives RSS to **154.2 GB** at 06:59:18 — just 9 minutes after `makePSF`
started.

### 5.2 OOM at minor cycle entry

After makePSF (4h 7m) and major cycle 1 (6h 16m), CASA reaches
`SynthesisDeconvolver::initMinorCycle` at 17:14:26.  The log shows:

```
2026-03-04 17:14:26  initMinorCycle  Absolute Peak residual over full image: 0.134128
2026-03-04 17:14:26  setupMask  Setting up an auto-mask
signal 9 (Killed).
```

The kernel OOM-kills rank 1 during automask setup, before any CLEAN
component is subtracted.  The page cache had collapsed from 122 GB → 7 GB
(−115 GB) by this point.

### 5.3 I/O write amplification (8.24 TB for 1.48 TB output)

Between weight scratch, per-rank partial grids, and per-iteration residual
writes, tclean writes **~5.6×** more data than the partial output size.  This
saturates the NVMe subsystem and contributes to the 6h 16m major cycle time.

### 5.4 No concat phase

tclean produces a single monolithic CASA image directly — no subcube
concatenation step.  However, this comes at the cost of requiring all
channel data live in memory simultaneously.

## 6. Bottleneck Analysis

The fundamental bottleneck is **memory**: tclean's MPI parallel model holds
all 1000 channel planes across all ranks simultaneously, and the minor cycle
requires the full residual + PSF + automask structure in memory at once.

No tuning of MPI rank count or I/O can work around this — the monolithic cube
model requires $O(\text{nchan})$ memory, which exceeds the system's 128 GB
for this cube size.

## 7. Comparison with pclean

| | tclean (MPI, 11 CASA ranks) | pclean (10 Dask workers) |
|---|---|---|
| niter | 50000 (killed before minor) | 0 (makePSF only) |
| Parallelism | 11 CASA MPI ranks (NProc=25 total) | 10 Dask workers (NProc=13 total) |
| makePSF wall time | **4h 7m 35s** (06:50:27→10:58:02) | **≤ 10h 25m total** (PSF+save) |
| Major cycle 1 | **6h 16m 24s** (10:58:02→17:14:26, partial) | N/A |
| Concat overhead | N/A (monolithic image) | 3h 20m (7 extensions, serial) |
| **Peak RSS** | **154.2 GB** (06:59:18) | **58.6 GB** |
| Peak virtual | **171.2 GB** | **81.6 GB** |
| Peak MMap RSS | **146.5 GB** | **52.4 GB** |
| Swap | **0 MB** | ~5 MB |
| Page cache depleted | 122 → 7.4 GB | 122 → 15.4 GB |
| Peak CPU | ~1105% | ~1095% |
| Total I/O reads | **2.33 TB** | **6.71 TB** |
| Total I/O writes | **8.24 TB** (incomplete) | **9.83 TB** (complete) |
| Final output size | **1.48 TB** (partial, killed) | **1.42 TB** (7 extensions, complete) |
| OOM killed | **Yes** | **No** |
| Run completed | **No** | **Yes** (13h 47m) |

pclean uses **~63% less peak memory** (58.6 GB vs 154.2 GB) because each
worker loads only one channel at a time, while tclean holds all 1000 planes
collectively.

> **Note**: The pclean run was configured with `niter=50000`, but converged
> after 0 minor-cycle iterations (effectively makePSF-only / `niter=0` in
> practice). A strict apples-to-apples comparison would require a pclean run
> that reaches the same minor-cycle depth as the tclean configuration.

## 8. Conclusions

For a 1000-channel, 8000×8000 cube with briggsbwtaper weighting:

- tclean with `parallel=True` is **OOM-killed** before completing even one
  minor cycle
- pclean completes PSF across 1000 channels in ~10h without OOM
- pclean's per-channel parallelism trades concat overhead for dramatically
  lower memory footprint — a bottleneck that has since been addressed
  (see `alma_pclean_perf_v1.md` § 6)