Utilities API

Memory Estimation

Heuristic RAM estimator for parallel CASA imaging workers.

CASA’s C++ imaging engine (casatools) allocates multiple image-sized buffers during gridding that Python and Dask cannot track or free. This module provides a rough estimate of peak RAM usage so that users can choose an appropriate nworkers for their system.

Memory model

During active imaging of a single sub-cube, CASA keeps approximately the following buffers resident (per channel):

Buffer

Dtype

Bytes/pixel

Complex visibility grid

complex64

8

Weight grid

complex64

8

FFT workspace (in + out)

complex64

16

Residual image

float32

4

Model image

float32

4

PSF image

float32

4

Weight image (sumwt)

float32

4

Primary beam (PB)

float32

4

Mask

float32

4

Temporary / bookkeeping

mixed

~20

This sums to roughly 76 bytes per pixel per channel for a standard gridder with deconvolver='hogbom' and Stokes I.

Scaling factors (multiplicative):

  • Mosaic gridder — each pointing requires a convolution function (CF) table; memory scales with the number of fields and CF support size. A 1.5x–3x multiplier over standard is typical.

  • MTMFS deconvolver — internal Hessian products scale as nterms squared.

  • Multi-channel sub-cubes — linear in nchan_per_task.

Calibration

The 76 B/pix/chan constant was calibrated against an ALMA Band 6 cube-imaging run (IRC+10216, 8000 x 8000, 40 antennas, 449 280 rows, gridder='standard', deconvolver='hogbom'), where each worker consumed ~4.9 GiB of C++ memory with 1 channel per task.

4.9 GiB / (8000 * 8000 * 1 chan) ≈ 76 B/pix/chan

The MS row count (nrows) contributes negligibly — visibilities are processed in row chunks that occupy a few MB, dwarfed by the multi-GiB image grids. It is included only as a minor additive term.

pclean.utils.memory_estimate.BYTES_PER_PIXEL_STANDARD: float = 76.0

Bytes per pixel per channel for the standard gridder (Stokes I, hogbom).

pclean.utils.memory_estimate.WORKER_BASE_OVERHEAD_GIB: float = 0.7

Python + Dask worker process baseline overhead (GiB).

pclean.utils.memory_estimate.estimate_worker_memory_gib(imsize, nchan_per_task=1, gridder='standard', deconvolver='hogbom', nterms=1, nfields=1)[source]

Estimate peak RAM (GiB) consumed by a single worker.

Parameters:
  • imsize (Sequence[int] | int) – Image dimensions in pixels. A scalar is treated as a square.

  • nchan_per_task (int) – Number of channels each worker images (cube_chunksize).

  • gridder (str) – Gridder name (standard, mosaic, wproject, etc.).

  • deconvolver (str) – Deconvolver name. mtmfs triggers the nterms multiplier.

  • nterms (int) – Number of Taylor terms (only relevant for mtmfs).

  • nfields (int) – Number of mosaic pointings (used to scale mosaic overhead).

Returns:

Estimated peak memory in GiB.

Return type:

float

Examples:

>>> estimate_worker_memory_gib(imsize=8000, nchan_per_task=1)
5.22...
>>> estimate_worker_memory_gib(imsize=[1280, 1024], gridder='mosaic',
...                            deconvolver='mtmfs', nterms=2)
5.08...
pclean.utils.memory_estimate.estimate_peak_ram_gib(nworkers, imsize, nchan_per_task=1, gridder='standard', deconvolver='hogbom', nterms=1, nfields=1)[source]

Estimate peak system RAM (GiB) for nworkers concurrent tasks.

Parameters:
Returns:

Estimated total peak RAM in GiB.

Return type:

float

pclean.utils.memory_estimate.recommend_nworkers(available_ram_gib=None, imsize=4096, nchan_per_task=1, gridder='standard', deconvolver='hogbom', nterms=1, nfields=1, ram_safety_factor=0.85)[source]

Suggest the maximum number of workers that fit in available RAM.

Parameters:
Returns:

Recommended number of workers (at least 1).

Return type:

int

Partitioning

Data and image partitioning utilities.

Uses casatools.synthesisutils to divide data for continuum (row-based) and cube (frequency-based) parallelism, and also provides pure-Python fallback partitioners.

pclean.utils.partition.partition_continuum(config, nparts)[source]

Partition data by visibility rows for parallel continuum imaging.

Uses synthesisutils.contdatapartition() to split each MS across nparts workers. Each returned dict is a CASA-native parameter bundle with selection narrowed to its row chunk and a unique partial image name.

Parameters:
  • config (PcleanConfig) – Full imaging configuration.

  • nparts (int) – Number of partitions.

Returns:

One CASA-native bundle (dict) per worker.

Return type:

list[dict]

pclean.utils.partition.partition_cube(config, nparts)[source]

Partition the output cube by frequency channels for parallel cube imaging.

Uses synthesisutils.cubedataimagepartition() when possible, falling back to an even-split heuristic.

Parameters:
  • config (PcleanConfig) – Full imaging configuration.

  • nparts (int) – Number of partitions.

Returns:

One PcleanConfig per worker, covering a non-overlapping range of output channels.

Return type:

list[PcleanConfig]

pclean.utils.partition.partial_image_name(base, part_index)[source]

Return the partial-image path for a given partition index.

Parameters:
Return type:

str

Image Concatenation

Image concatenation utilities.

After parallel cube imaging each worker produces a sub-cube. This module concatenates them into the final output cube, mirroring the ia.imageconcat() call used in CASA’s parallel cube imager.

Three concatenation modes are supported (via concat_mode in ClusterConfig):

  • paged (default): Pixel data are physically copied into a new self-contained CASA image. Slower but fully independent of the subcubes after completion.

  • virtual (mode='nomovevirtual'): The output image is a lightweight reference catalog that points at the original subcube files. Near-instant but requires the subcubes to stay on disk (keep_subcubes=True).

  • movevirtual (mode='movevirtual'): The subcube directories are renamed (moved) into the output image. Near-instant on the same filesystem; the subcubes are consumed in the process.

When concat_mode='auto' (the default), the mode is derived from keep_subcubes: True → virtual, False → paged.

When multiple extensions need concatenating (e.g. .image, .residual, .psf, …), a ProcessPoolExecutor (spawn start method) is used for paged mode so that each subprocess gets its own casacore TableCache and there is no shared C++ state between workers. Virtual modes are run sequentially because they are near-instant and write shared catalog metadata.

pclean.utils.image_concat.concat_images(outimage, inimages, axis=-1, relax=True, overwrite=True, mode='paged')[source]

Concatenate a list of CASA images along axis.

Parameters:
  • outimage (str) – Path for the output concatenated image.

  • inimages (list[str]) – Ordered list of input sub-images.

  • axis (int) – Axis to concatenate along (default -1 -> spectral).

  • relax (bool) – Relax axis checks.

  • overwrite (bool) – Overwrite outimage if it exists.

  • mode (str) – CASA imageconcat mode. 'paged' (default) physically copies data. 'nomovevirtual' creates a reference catalog (near-instant, but requires input images to stay on disk). 'movevirtual' creates a virtual concatenation by moving subcube directories into the output image.

Return type:

None

pclean.utils.image_concat.concat_subcubes(base_imagename, nparts, extensions=None, mode='paged', max_workers=4, virtual=None, _pool_cls=None)[source]

Concatenate all standard image products from numbered sub-cubes.

Products include .image, .residual, .psf, etc.

The mode parameter is forwarded directly to ia.imageconcat():

  • 'paged' — pixel data are physically copied (default, always safe). Extensions are concatenated in parallel via ProcessPoolExecutor (spawn context) so each subprocess owns an independent casacore TableCache — true I/O parallelism, no shared C++ state.

  • 'nomovevirtual' — lightweight reference catalog, near-instant but subcube files must remain on disk. Run sequentially because virtual-catalog metadata is shared across calls.

  • 'movevirtual' — renames subcubes into the output directory (near-instant, subcubes are consumed). Also run sequentially.

Deprecated since version The: virtual parameter is deprecated. Pass mode explicitly.

Parameters:
  • base_imagename (str) – The original imagename (without .subcube.N).

  • nparts (int) – Number of sub-cubes.

  • extensions (list[str] | None) – Image extensions to concatenate. Defaults to a standard set.

  • mode (str) – CASA imageconcat mode string.

  • max_workers (int) – Maximum parallel concatenation workers (paged mode only).

  • virtual (bool | None) – Deprecated. True maps to mode='nomovevirtual'.

Return type:

None

ADIOS2 Checks

Quick diagnostic to verify Adios2StMan availability in the current casatools build.

class pclean.utils.check_adios2.CasatoolsInfo(version='unknown', origin='unknown', conda_build_string='', adios2_supported=False, details=<factory>)[source]

Bases: object

Summary of the casatools installation.

Parameters:
version: str = 'unknown'
origin: str = 'unknown'
conda_build_string: str = ''
adios2_supported: bool = False
details: dict[str, str]
pclean.utils.check_adios2.get_casatools_info()[source]

Detect casatools version and whether it was installed via conda or pip.

Inspects conda-meta records first (definitive for conda installs), then falls back to importlib.metadata / pip provenance checks.

Returns:

A populated CasatoolsInfo dataclass.

Return type:

CasatoolsInfo

pclean.utils.check_adios2.check_adios2_support(*, cleanup=True)[source]

Create a throwaway CASA table with Adios2StMan and report whether it succeeds.

This attempts to bind a single float column to the Adios2StMan storage manager. If the underlying casacore was not compiled with ADIOS2 support (i.e. the nompi variant), a RuntimeError about an unknown storage manager is raised.

Parameters:

cleanup (bool) – Remove the temporary table directory after the check.

Returns:

True if Adios2StMan is available, False otherwise.

Return type:

bool

pclean.utils.check_adios2.ms_uses_adios2(ms_path)[source]

Check whether any column in the given MS is managed by Adios2StMan.

Opens the table read-only, inspects getdminfo(), and returns True if at least one data-manager entry has TYPE == 'Adios2StMan'.

Parameters:

ms_path (str) – Path to a MeasurementSet directory.

Returns:

True if the MS contains ADIOS2-managed columns.

Return type:

bool

pclean.utils.check_adios2.force_omp_single_thread()[source]

Force the OpenMP runtime to use exactly 1 thread.

General thread-safety precaution for ADIOS2-backed storage managers. CASA gridding internals can launch OpenMP tasks that concurrently access the MS; limiting to a single thread avoids potential data races in the ADIOS2 engine.

os.environ['OMP_NUM_THREADS'] alone is insufficient because libgomp reads the variable only once (at the first OpenMP call, typically during import casatools). This helper therefore also calls omp_set_num_threads(1) via ctypes to override the cached value immediately.

Return type:

None

ADIOS2 Conversion

Convert a MeasurementSet to use the Adios2StMan storage manager.

pclean.utils.convert_adios2.convert_ms_to_adios2(input_ms, output_ms, *, target_columns=('DATA', 'CORRECTED_DATA', 'MODEL_DATA', 'FLAG', 'WEIGHT', 'SIGMA'), overwrite=False, engine_type='BP4', engine_params=None, adios2_xml=None, taql=None)[source]

Copy a MeasurementSet, rebinding heavy columns to Adios2StMan.

The function reads the existing dminfo from input_ms, replaces the storage-manager type for every manager that handles one of the target_columns, and performs a deep valuecopy so that the bulk data is physically rewritten through the ADIOS2 C++ backend.

Sub-tables (ANTENNA, FIELD, SPECTRAL_WINDOW, etc.) are left on their default storage managers because their I/O footprint is negligible.

Note

Adios2StMan requires the copy to happen in a single Table::deepCopy pass. Manual row-level approaches (addrows + putcol, or copyrows) are not supported because the ADIOS2 engine needs cell shapes established through casacore’s internal copy path and does not allow reopening a table for append.

Casacore’s C++ deepCopy streams data row-by-row, but the ADIOS2 BP engine accumulates all Put() data within a single step — EndStep() / Close() run only in the Adios2StMan destructor. Use engine_params to control buffer sizing.

The default ADIOS2 engine (usually BP5) ignores MaxBufferSize; it only honours BufferChunkSize. This function defaults to BP4 and passes the engine type via the ENGINETYPE dminfo SPEC field so casacore’s Adios2StMan::makeObject sets the correct engine before opening.

Parameters:
  • input_ms (str) – Path to the source MeasurementSet.

  • output_ms (str) – Destination path for the ADIOS2-backed copy.

  • target_columns (tuple[str, ...] | list[str]) – Column names to rebind to Adios2StMan.

  • overwrite (bool) – If True, remove output_ms if it already exists.

  • engine_type (str) – ADIOS2 engine type. 'BP4' is recommended because BP4 respects MaxBufferSize and flushes to disk when the buffer exceeds that cap. BP5 uses a different allocation model (see BufferChunkSize).

  • engine_params (dict[str, str] | None) –

    ADIOS2 engine parameters. Useful keys:

    • MaxBufferSize — triggers flush when exceeded (BP4 only, e.g. '2Gb').

    • InitialBufferSize — starting allocation (BP4).

    • BufferGrowthFactor — growth multiplier (BP4).

    • BufferChunkSize — per-chunk size (BP5).

  • adios2_xml (str | None) – Path to a user-supplied ADIOS2 XML config file. If provided, engine_type and engine_params are ignored.

  • taql (str | None) – Optional TaQL WHERE clause to select a subset of rows before copying (e.g. 'DATA_DESC_ID IN [0]'). When set, tb.query(taql) is used as the copy source, so only matching rows are written. Sub-tables are copied as-is.

Returns:

The output_ms path on success.

Raises:
Return type:

str

pclean.utils.convert_adios2.split_and_convert_ms_to_adios2(input_ms, output_dir, *, target_columns=('DATA', 'CORRECTED_DATA', 'MODEL_DATA', 'FLAG', 'WEIGHT', 'SIGMA'), overwrite=False, engine_type='BP4', engine_params=None, adios2_xml=None)[source]

Select rows by SPW and convert each subset to Adios2StMan in one pass.

This implements Workaround 3 from the Adios2StMan debug notes. For each SPW the function builds a TaQL DATA_DESC_ID IN [...] clause and passes it to convert_ms_to_adios2 via the taql parameter. The row selection and ADIOS2 rebinding happen in a single deepCopy — no intermediate MS is written.

Sub-tables (SPECTRAL_WINDOW, DATA_DESCRIPTION, etc.) are copied as-is and therefore still contain entries for all SPWs. This is cosmetic; the imager only accesses rows present in the main table.

Parameters:
  • input_ms (str) – Path to the source MeasurementSet.

  • output_dir (str) – Directory under which per-SPW ADIOS2 datasets are written (<output_dir>/<basename>_spw<N>.ms).

  • target_columns (tuple[str, ...] | list[str]) – Columns to rebind to Adios2StMan.

  • overwrite (bool) – If True, remove existing outputs.

  • engine_type (str) – ADIOS2 engine type (forwarded to convert_ms_to_adios2).

  • engine_params (dict[str, str] | None) – ADIOS2 engine parameters.

  • adios2_xml (str | None) – Path to a user-supplied ADIOS2 XML config.

Returns:

List of output ADIOS2-backed MS paths.

Return type:

list[str]