# ALMA Cube Imaging — pclean Performance Report (v1)

## 1. Test Configuration

| Parameter | Value |
|---|---|
| Dataset | ALMA `uid___A002_Xf0fd41_X5f5a_target.ms`, SPW 25, IRC+10216 |
| Image size | 8000 × 8000 px, 0.0046 arcsec/px |
| Spectral | specmode=cube, 1000 channels |
| Weighting | briggsbwtaper, robust=0.5 |
| Deconvolver | hogbom, **niter=50000** — configured but 0 iterations executed (see note below) |
| Parallelism | Dask `LocalCluster`, 10 workers, `cube_chunksize=1` |
| Subcubes | 1000 (one per channel) |
| Log files | `test_alma_pclean_1.log` / `test_alma_pclean_1.rec` |

## 2. Result Summary

| Outcome | Value |
|---|---|
| **Completed** | **Yes** |
| Total wall time | **13h 47m 33s** |
| Imaging wall time | 10h 25m 17s |
| Concat wall time | 3h 20m (24% of total) |
| OOM killed | No |
| Peak RSS | 58.6 GB |

> **Note on deconvolution**: `niter=50000` was set in the run script but
> no Hogbom iterations were actually executed.  `SynthesisDeconvolver::setupDeconvolution`
> was called (hogbom configured), `initMinorCycle` ran, and `setupMask`
> (auto-multithresh) ran.  However, `executeminorcycle` never appears in the
> log.  The first convergence check logged *"Peak residual within mask : 0"*
> (empty mask), which likely caused `iterbotsink.cleanComplete()` to set an
> early-stop state before the mask was populated.  The run output is
> equivalent to PSF + residual + restore only.
> **Fixed (post-v1)**: `SerialImager.run()` now calls `update_mask()` before
> the first `has_converged()` check so that `initminorcycle()` sees a
> non-empty mask and `cleanComplete()` correctly returns `False`.

## 3. Phase Timings

| Phase | Start | End | Duration |
|---|---|---|---|
| Cluster start + partition | 06:49:54 | 06:49:56 | ~2 s |
| Parallel imaging (1000 subcubes, 10 workers) | 06:49:56 | 17:15:13 | **10h 25m 17s** |
| Pre-concat collect + handoff | 17:15:13 | 17:15:49 | ~36 s |
| `.image` concat (1000 inputs) | 17:15:49 | 18:20:20 | **64m 31s** |
| `.residual` concat (1000 inputs) | 18:20:41 | 19:07:31 | **46m 50s** |
| `.psf` concat (1000 inputs) | 19:07:51 | 19:42:50 | **34m 59s** |
| `.model` concat (1000 inputs) | 19:43:10 | 19:55:48 | **12m 38s** |
| `.pb` concat (1000 inputs) | 19:56:08 | 20:21:52 | **25m 44s** |
| `.mask` concat (1000 inputs) | 20:22:12 | 20:34:57 | **12m 45s** |
| `.sumwt` concat (1000 inputs) | 20:35:19 | 20:35:45 | **26s** |
| Cleanup | 20:35:45 | 20:37:27 | ~1m 42s |
| **Total** | **06:49:54** | **20:37:27** | **13h 47m 33s** |

Extensions concatenated: 7 (`.image.pbcor` and `.weight` not created at `niter=0`).

**Note**: `.model` (all-zero) and `.sumwt` (tiny: 1×1×1×1000) concat in <13 min
and 26 s respectively — pixel volume matters far more than file count.

## 4. Resource Usage

*Source: psrecord*

### 4.1 Processes

| Phase | NProc |
|---|---|
| Startup | 2 |
| Workers launched | 13 (main + 10 workers + Dask scheduler) |
| Post-imaging concat | 13 (workers idle but alive) |

### 4.2 Memory

| Metric | Value |
|---|---|
| Peak RSS | **58.6 GB** (during parallel imaging) |
| Peak virtual | **81.6 GB** |
| Peak MMap RSS | **52.4 GB** (CASA memory-mapped image files) |
| Swap used | ~5 MB (negligible — no thrashing) |
| RSS during concat | ~4–5 GB (workers idle, only coordinator active) |
| System page cache (start) | 122 GB |
| System page cache (end) | **15.4 GB** (exhausted by concat I/O) |

Peak RSS of 58.6 GB is spread across 10 workers simultaneously, each holding
one channel's worth of PSF + sensitivity store + gridder scratch (~5–6 GB each).
Memory drops after imaging completes because Dask reclaims worker process memory.

### 4.3 CPU

| Phase | CPU % (aggregate) |
|---|---|
| Worker ramp-up | 900–1000% (all 10 cores) |
| Sustained imaging | ~1000–1095% (near-linear 10-core scaling) |
| Concat (serial `imageconcat`) | ~160% (single-threaded I/O) |

Peak CPU of **1095%** indicates all 10 workers plus the coordinator were fully
active.  Concat drops to ~160% because `ia.imageconcat()` is single-threaded
and I/O-bound.

### 4.4 I/O

| Metric | Value |
|---|---|
| Total bytes read | **6.71 TB** |
| Total bytes written | **9.83 TB** |
| Final output dir size | **1.42 TB** (7 extensions) |
| Write amplification | **~6.9×** final output size |

**Write amplification breakdown (estimated):**

1. **Imaging phase** (~2.3 TB): 1000 workers × 9 subcube extensions.
   Per-channel single plane: 8000 × 8000 × 4 B = 256 MB × 9 extensions.
   Plus weight scratch files, gridder temp files, per-worker table caches.
2. **Concat phase**: reads all 1000 subcube inputs (~2.3 TB) per
   extension and writes merged output (physical copy doubles I/O).

## 5. Key Observations

### 5.1 Per-channel parallelism avoids OOM

Each Dask worker loads only a single channel at a time (~5–6 GB per worker).
This keeps peak RSS at 58.6 GB across 10 workers — **45% less** than tclean's
107.8 GB across 25 MPI ranks which hold all 1000 planes simultaneously.

### 5.2 Concat was the primary bottleneck

At 3h 20m (24% of total wall time), subcube concatenation was the single
largest post-imaging overhead.  Two root causes:

1. **Serial extension loop** — 7 independent extensions processed one at a
   time, making total time proportional to their sum.
2. **Physical-copy only** — `ia.imageconcat()` was called without forwarding
   the `mode` parameter, so every pixel was read and rewritten.

### 5.3 Page cache exhaustion during concat

System page cache collapsed from 122 GB → 15.4 GB as the coordinator
sequentially read 1000 subcubes per extension.  This is disk-throughput
bounded, not CPU bounded.

## 6. Bottleneck Analysis & Optimization Options

### Option A — Parallel extension concat (implemented)

Use `ProcessPoolExecutor` (spawn context) to run independent extensions in parallel worker processes.
CASA releases the GIL during I/O, so processes can overlap I/O and CPU work effectively.
With `max_workers=4`, estimated wall time drops from ~3h 20m to ~65 min.

### Option B — Virtual / reference concat (implemented)

Forward `mode='nomovevirtual'` to `ia.imageconcat()` to create a lightweight
reference catalog instead of copying pixels.  Near-instant (<1 min) but
requires subcubes to remain on disk (`keep_subcubes=True`).

### Option C — `movevirtual` mode (implemented)

`mode='movevirtual'` renames subcube directories into the output image.
Near-instant on the same filesystem.  Subcubes are consumed in the process.

### Option D — Larger `cube_chunksize`

With `cube_chunksize=10`, 1000 channels → 100 subcubes, reducing file-open
overhead by ~10×.

| `cube_chunksize` | Subcubes | Est. concat time (paged, serial) |
|---|---|---|
| 1 | 1000 | ~3h 20m |
| 5 | 200 | ~1.5–2 h |
| 10 | 100 | ~45–60 min |
| 50 | 20 | ~10 min |

### Resolution

Options A + B + C have been implemented via the `concat_mode` parameter
in `ClusterConfig` (values: `auto`, `paged`, `virtual`, `movevirtual`).
See `notes/concat_optimize.md` for implementation details.

## 7. Code References

- `src/pclean/utils/image_concat.py` — `concat_images()`, `concat_subcubes()`
- `src/pclean/parallel/cube_parallel.py` — `concat_subcubes()` call site + `_cleanup_subcubes()`
- `src/pclean/config.py` — `ClusterConfig.concat_mode`