# ALMA Cube Imaging — tclean Performance Report (v1) ## 1. Test Configuration | Parameter | Value | |---|---| | Dataset | ALMA `uid___A002_Xf0fd41_X5f5a_target.ms`, SPW 25, IRC+10216 | | Image size | 8000 × 8000 px, 0.0046 arcsec/px | | Spectral | specmode=cube, 1000 channels | | Weighting | briggsbwtaper, robust=0.5, fracbw=0.000907945 | | Deconvolver | hogbom, niter=50000, threshold=4.0mJy, usemask=auto-multithresh | | Parallelism | CASA `parallel=True` (MPI), 11 CASA worker ranks (`OMPI_COMM_WORLD_SIZE=11`); psrecord NProc peak 25 (includes MPI infrastructure processes) | | CASA version | 6.7.3-21 | | Log files | `test_alma_tclean_1.log` / `test_alma_tclean_1.rec` | ## 2. Result Summary | Outcome | Value | |---|---| | **Completed** | **No — OOM-killed** | | psrecord window | 2026-03-04T06:50:14 → 2026-03-04T17:20:01 | | Elapsed before kill (log) | ~10h 24m (06:50:14 → 17:14:26) | | Last phase reached | `initMinorCycle` (before any deconvolution) | | Kill signal | SIGKILL on MPI rank 1 (signal 9) | | **Peak RSS** | **154.2 GB** at 06:59:18 (early PSF setup) | | Final sample RSS | 107.8 GB at 17:20:01 (post-kill cleanup state) | The 50000-iteration deconvolution was **never executed**. ## 3. Phase Timings | Phase | Start | End | Duration | |---|---|---|---| | casahouse init / MPI startup | 06:50:14 | 06:50:25 | ~11 s | | `tclean()` call / setup | 06:50:25 | 06:50:27 | ~2 s | | `makePSF` (1000 ch, parallel) | 06:50:27 | 10:58:02 | **4h 7m 35s** | | `executeMajorCycle 1` | 10:58:02 | 17:14:26 | **6h 16m 24s** | | `initMinorCycle` / automask setup | 17:14:26 | 17:14:27 | < 1 s | | **SIGKILL (rank 1)** | 17:14:27 | — | **OOM-killed** | | Cleanup / lingering procs | 17:14:27 | 17:20:01 | ~5 min | ## 4. Resource Usage *Source: psrecord* ### 4.1 Processes | Metric | Value | |---|---| | CASA worker ranks | 11 (`OMPI_COMM_WORLD_SIZE=11`; CASA reports "Processes on node: 11") | | Total NProc (psrecord peak) | 25 (includes prte/pmix daemons, mpirun, coordinator) | ### 4.2 Memory | Metric | Value | Timestamp | |---|---|---| | **Peak RSS** | **154.2 GB** | 2026-03-04T06:59:18 (early PSF setup) | | **Peak virtual** | **171.2 GB** | 2026-03-04T06:59:18 | | **Peak MMap RSS** | **146.5 GB** | 2026-03-04T06:59:18 | | Swap used | **0 MB** | (no swap at any point) | | RSS at last sample (17:20:01) | 107.8 GB | post-kill cleanup | | Virtual at last sample | 126.8 GB | post-kill cleanup | | MMap at last sample | 81.9 GB | post-kill cleanup | | System page cache (start) | 122.1 GB | 2026-03-04T06:51:39 | | System page cache (end) | **7.4 GB** | 2026-03-04T17:20:01 | > **Note on peak RSS**: The previously reported value of 107.8 GB was the > *final* psrecord sample at 17:20:01 — measured during post-kill cleanup, > not during the run. The true peak of **154.2 GB** occurred at 06:59:18 > during the early PSF weight-grid allocation. The page cache collapsing from 122 GB → 7 GB confirms severe memory pressure from holding all 1000 channel planes simultaneously across 11 CASA worker ranks. ### 4.3 CPU | Phase | CPU % (aggregate) | |---|---| | Sustained imaging | ~1000–1105% (11 CASA ranks; peak 1105.2%) | ### 4.4 I/O | Metric | Value | Note | |---|---|---| | Total bytes read | **2.33 TB** | decimal (psrecord final) | | Total bytes written | **8.24 TB** | decimal (psrecord final) | | Output dir size | **1.48 TB** | decimal; 1484099 MB at 17:20:01 | | Write amplification | **~5.6×** output size | 8.24 / 1.48 | **Write volume breakdown (estimated):** Each channel plane: 8000 × 8000 × 4 B = 256 MB per extension. 1000 channels × ~9 extensions = ~2.3 TB expected final output. Total writes 8.24 TB ≈ 3.5× the expected output, due to: - Weight maps (per-channel BriggsBWTaper weight grids) - Partial gridded visibility scratch data per MPI rank - Repeated residual/model writes across major cycle iterations ## 5. Key Observations ### 5.1 All 1000 channel planes live in memory simultaneously All 11 CASA worker ranks grid 1000 channels collectively (330 subcubes per pass, as logged: *"Subcubes: 330. Processes on node: 11"*), requiring the full `briggsbwtaper` weight grid before any PSF plane is finalized. This drives RSS to **154.2 GB** at 06:59:18 — just 9 minutes after `makePSF` started. ### 5.2 OOM at minor cycle entry After makePSF (4h 7m) and major cycle 1 (6h 16m), CASA reaches `SynthesisDeconvolver::initMinorCycle` at 17:14:26. The log shows: ``` 2026-03-04 17:14:26 initMinorCycle Absolute Peak residual over full image: 0.134128 2026-03-04 17:14:26 setupMask Setting up an auto-mask signal 9 (Killed). ``` The kernel OOM-kills rank 1 during automask setup, before any CLEAN component is subtracted. The page cache had collapsed from 122 GB → 7 GB (−115 GB) by this point. ### 5.3 I/O write amplification (8.24 TB for 1.48 TB output) Between weight scratch, per-rank partial grids, and per-iteration residual writes, tclean writes **~5.6×** more data than the partial output size. This saturates the NVMe subsystem and contributes to the 6h 16m major cycle time. ### 5.4 No concat phase tclean produces a single monolithic CASA image directly — no subcube concatenation step. However, this comes at the cost of requiring all channel data live in memory simultaneously. ## 6. Bottleneck Analysis The fundamental bottleneck is **memory**: tclean's MPI parallel model holds all 1000 channel planes across all ranks simultaneously, and the minor cycle requires the full residual + PSF + automask structure in memory at once. No tuning of MPI rank count or I/O can work around this — the monolithic cube model requires $O(\text{nchan})$ memory, which exceeds the system's 128 GB for this cube size. ## 7. Comparison with pclean | | tclean (MPI, 11 CASA ranks) | pclean (10 Dask workers) | |---|---|---| | niter | 50000 (killed before minor) | 0 (makePSF only) | | Parallelism | 11 CASA MPI ranks (NProc=25 total) | 10 Dask workers (NProc=13 total) | | makePSF wall time | **4h 7m 35s** (06:50:27→10:58:02) | **≤ 10h 25m total** (PSF+save) | | Major cycle 1 | **6h 16m 24s** (10:58:02→17:14:26, partial) | N/A | | Concat overhead | N/A (monolithic image) | 3h 20m (7 extensions, serial) | | **Peak RSS** | **154.2 GB** (06:59:18) | **58.6 GB** | | Peak virtual | **171.2 GB** | **81.6 GB** | | Peak MMap RSS | **146.5 GB** | **52.4 GB** | | Swap | **0 MB** | ~5 MB | | Page cache depleted | 122 → 7.4 GB | 122 → 15.4 GB | | Peak CPU | ~1105% | ~1095% | | Total I/O reads | **2.33 TB** | **6.71 TB** | | Total I/O writes | **8.24 TB** (incomplete) | **9.83 TB** (complete) | | Final output size | **1.48 TB** (partial, killed) | **1.42 TB** (7 extensions, complete) | | OOM killed | **Yes** | **No** | | Run completed | **No** | **Yes** (13h 47m) | pclean uses **~63% less peak memory** (58.6 GB vs 154.2 GB) because each worker loads only one channel at a time, while tclean holds all 1000 planes collectively. > **Note**: The pclean run was configured with `niter=50000`, but converged > after 0 minor-cycle iterations (effectively makePSF-only / `niter=0` in > practice). A strict apples-to-apples comparison would require a pclean run > that reaches the same minor-cycle depth as the tclean configuration. ## 8. Conclusions For a 1000-channel, 8000×8000 cube with briggsbwtaper weighting: - tclean with `parallel=True` is **OOM-killed** before completing even one minor cycle - pclean completes PSF across 1000 channels in ~10h without OOM - pclean's per-channel parallelism trades concat overhead for dramatically lower memory footprint — a bottleneck that has since been addressed (see `alma_pclean_perf_v1.md` § 6)