Utilities API¶
Memory Estimation¶
Heuristic RAM estimator for parallel CASA imaging workers.
CASA’s C++ imaging engine (casatools) allocates multiple image-sized
buffers during gridding that Python and Dask cannot track or free.
This module provides a rough estimate of peak RAM usage so that
users can choose an appropriate nworkers for their system.
Memory model¶
During active imaging of a single sub-cube, CASA keeps approximately the following buffers resident (per channel):
Buffer |
Dtype |
Bytes/pixel |
|---|---|---|
Complex visibility grid |
complex64 |
8 |
Weight grid |
complex64 |
8 |
FFT workspace (in + out) |
complex64 |
16 |
Residual image |
float32 |
4 |
Model image |
float32 |
4 |
PSF image |
float32 |
4 |
Weight image (sumwt) |
float32 |
4 |
Primary beam (PB) |
float32 |
4 |
Mask |
float32 |
4 |
Temporary / bookkeeping |
mixed |
~20 |
This sums to roughly 76 bytes per pixel per channel for a
standard gridder with deconvolver='hogbom' and Stokes I.
Scaling factors (multiplicative):
Mosaic gridder — each pointing requires a convolution function (CF) table; memory scales with the number of fields and CF support size. A 1.5x–3x multiplier over standard is typical.
MTMFS deconvolver — internal Hessian products scale as nterms squared.
Multi-channel sub-cubes — linear in
nchan_per_task.
Calibration¶
The 76 B/pix/chan constant was calibrated against an ALMA Band 6
cube-imaging run (IRC+10216, 8000 x 8000, 40 antennas, 449 280 rows,
gridder='standard', deconvolver='hogbom'), where each worker
consumed ~4.9 GiB of C++ memory with 1 channel per task.
4.9 GiB / (8000 * 8000 * 1 chan) ≈ 76 B/pix/chan
The MS row count (nrows) contributes negligibly — visibilities are processed in row chunks that occupy a few MB, dwarfed by the multi-GiB image grids. It is included only as a minor additive term.
- pclean.utils.memory_estimate.BYTES_PER_PIXEL_STANDARD: float = 76.0¶
Bytes per pixel per channel for the standard gridder (Stokes I, hogbom).
- pclean.utils.memory_estimate.WORKER_BASE_OVERHEAD_GIB: float = 0.7¶
Python + Dask worker process baseline overhead (GiB).
- pclean.utils.memory_estimate.estimate_worker_memory_gib(imsize, nchan_per_task=1, gridder='standard', deconvolver='hogbom', nterms=1, nfields=1)[source]¶
Estimate peak RAM (GiB) consumed by a single worker.
- Parameters:
imsize (Sequence[int] | int) – Image dimensions in pixels. A scalar is treated as a square.
nchan_per_task (int) – Number of channels each worker images (
cube_chunksize).gridder (str) – Gridder name (
standard,mosaic,wproject, etc.).deconvolver (str) – Deconvolver name.
mtmfstriggers the nterms multiplier.nterms (int) – Number of Taylor terms (only relevant for
mtmfs).nfields (int) – Number of mosaic pointings (used to scale mosaic overhead).
- Returns:
Estimated peak memory in GiB.
- Return type:
Examples:
>>> estimate_worker_memory_gib(imsize=8000, nchan_per_task=1) 5.22... >>> estimate_worker_memory_gib(imsize=[1280, 1024], gridder='mosaic', ... deconvolver='mtmfs', nterms=2) 5.08...
- pclean.utils.memory_estimate.estimate_peak_ram_gib(nworkers, imsize, nchan_per_task=1, gridder='standard', deconvolver='hogbom', nterms=1, nfields=1)[source]¶
Estimate peak system RAM (GiB) for nworkers concurrent tasks.
- Parameters:
nworkers (int) – Number of concurrent Dask workers.
imsize (Sequence[int] | int) – Forwarded to
estimate_worker_memory_gib().nchan_per_task (int) – Forwarded to
estimate_worker_memory_gib().gridder (str) – Forwarded to
estimate_worker_memory_gib().deconvolver (str) – Forwarded to
estimate_worker_memory_gib().nterms (int) – Forwarded to
estimate_worker_memory_gib().nfields (int) – Forwarded to
estimate_worker_memory_gib().
- Returns:
Estimated total peak RAM in GiB.
- Return type:
- pclean.utils.memory_estimate.recommend_nworkers(available_ram_gib=None, imsize=4096, nchan_per_task=1, gridder='standard', deconvolver='hogbom', nterms=1, nfields=1, ram_safety_factor=0.85)[source]¶
Suggest the maximum number of workers that fit in available RAM.
- Parameters:
available_ram_gib (float | None) – Total system RAM in GiB.
Nonereads from the OS.imsize (Sequence[int] | int) – Forwarded to
estimate_worker_memory_gib().nchan_per_task (int) – Forwarded to
estimate_worker_memory_gib().gridder (str) – Forwarded to
estimate_worker_memory_gib().deconvolver (str) – Forwarded to
estimate_worker_memory_gib().nterms (int) – Forwarded to
estimate_worker_memory_gib().nfields (int) – Forwarded to
estimate_worker_memory_gib().ram_safety_factor (float) – Fraction of available RAM to target (default 0.85 = 85%).
- Returns:
Recommended number of workers (at least 1).
- Return type:
Partitioning¶
Data and image partitioning utilities.
Uses casatools.synthesisutils to divide data for continuum
(row-based) and cube (frequency-based) parallelism, and also
provides pure-Python fallback partitioners.
- pclean.utils.partition.partition_continuum(config, nparts)[source]¶
Partition data by visibility rows for parallel continuum imaging.
Uses
synthesisutils.contdatapartition()to split each MS across nparts workers. Each returned dict is a CASA-native parameter bundle with selection narrowed to its row chunk and a unique partial image name.- Parameters:
config (PcleanConfig) – Full imaging configuration.
nparts (int) – Number of partitions.
- Returns:
One CASA-native bundle (dict) per worker.
- Return type:
- pclean.utils.partition.partition_cube(config, nparts)[source]¶
Partition the output cube by frequency channels for parallel cube imaging.
Uses
synthesisutils.cubedataimagepartition()when possible, falling back to an even-split heuristic.- Parameters:
config (PcleanConfig) – Full imaging configuration.
nparts (int) – Number of partitions.
- Returns:
One
PcleanConfigper worker, covering a non-overlapping range of output channels.- Return type:
Image Concatenation¶
Image concatenation utilities.
After parallel cube imaging each worker produces a sub-cube. This
module concatenates them into the final output cube, mirroring the
ia.imageconcat() call used in CASA’s parallel cube imager.
Three concatenation modes are supported (via concat_mode in
ClusterConfig):
paged (default): Pixel data are physically copied into a new self-contained CASA image. Slower but fully independent of the subcubes after completion.
virtual (
mode='nomovevirtual'): The output image is a lightweight reference catalog that points at the original subcube files. Near-instant but requires the subcubes to stay on disk (keep_subcubes=True).movevirtual (
mode='movevirtual'): The subcube directories are renamed (moved) into the output image. Near-instant on the same filesystem; the subcubes are consumed in the process.
When concat_mode='auto' (the default), the mode is derived from
keep_subcubes: True → virtual, False → paged.
When multiple extensions need concatenating (e.g. .image, .residual,
.psf, …), a ProcessPoolExecutor (spawn start method) is used for
paged mode so that each subprocess gets its own casacore TableCache and
there is no shared C++ state between workers. Virtual modes are run
sequentially because they are near-instant and write shared catalog metadata.
- pclean.utils.image_concat.concat_images(outimage, inimages, axis=-1, relax=True, overwrite=True, mode='paged')[source]¶
Concatenate a list of CASA images along axis.
- Parameters:
outimage (str) – Path for the output concatenated image.
axis (int) – Axis to concatenate along (default -1 -> spectral).
relax (bool) – Relax axis checks.
overwrite (bool) – Overwrite outimage if it exists.
mode (str) – CASA imageconcat mode.
'paged'(default) physically copies data.'nomovevirtual'creates a reference catalog (near-instant, but requires input images to stay on disk).'movevirtual'creates a virtual concatenation by moving subcube directories into the output image.
- Return type:
None
- pclean.utils.image_concat.concat_subcubes(base_imagename, nparts, extensions=None, mode='paged', max_workers=4, virtual=None, _pool_cls=None)[source]¶
Concatenate all standard image products from numbered sub-cubes.
Products include
.image,.residual,.psf, etc.The mode parameter is forwarded directly to
ia.imageconcat():'paged'— pixel data are physically copied (default, always safe). Extensions are concatenated in parallel viaProcessPoolExecutor(spawncontext) so each subprocess owns an independent casacoreTableCache— true I/O parallelism, no shared C++ state.'nomovevirtual'— lightweight reference catalog, near-instant but subcube files must remain on disk. Run sequentially because virtual-catalog metadata is shared across calls.'movevirtual'— renames subcubes into the output directory (near-instant, subcubes are consumed). Also run sequentially.
Deprecated since version The: virtual parameter is deprecated. Pass mode explicitly.
- Parameters:
base_imagename (str) – The original
imagename(without.subcube.N).nparts (int) – Number of sub-cubes.
extensions (list[str] | None) – Image extensions to concatenate. Defaults to a standard set.
mode (str) – CASA
imageconcatmode string.max_workers (int) – Maximum parallel concatenation workers (paged mode only).
virtual (bool | None) – Deprecated.
Truemaps tomode='nomovevirtual'.
- Return type:
None
ADIOS2 Checks¶
Quick diagnostic to verify Adios2StMan availability in the current casatools build.
- class pclean.utils.check_adios2.CasatoolsInfo(version='unknown', origin='unknown', conda_build_string='', adios2_supported=False, details=<factory>)[source]¶
Bases:
objectSummary of the casatools installation.
- Parameters:
- pclean.utils.check_adios2.get_casatools_info()[source]¶
Detect casatools version and whether it was installed via conda or pip.
Inspects conda-meta records first (definitive for conda installs), then falls back to
importlib.metadata/pipprovenance checks.- Returns:
A populated
CasatoolsInfodataclass.- Return type:
- pclean.utils.check_adios2.check_adios2_support(*, cleanup=True)[source]¶
Create a throwaway CASA table with Adios2StMan and report whether it succeeds.
This attempts to bind a single float column to the
Adios2StManstorage manager. If the underlyingcasacorewas not compiled with ADIOS2 support (i.e. thenompivariant), aRuntimeErrorabout an unknown storage manager is raised.
- pclean.utils.check_adios2.ms_uses_adios2(ms_path)[source]¶
Check whether any column in the given MS is managed by Adios2StMan.
Opens the table read-only, inspects
getdminfo(), and returnsTrueif at least one data-manager entry hasTYPE == 'Adios2StMan'.
- pclean.utils.check_adios2.force_omp_single_thread()[source]¶
Force the OpenMP runtime to use exactly 1 thread.
General thread-safety precaution for ADIOS2-backed storage managers. CASA gridding internals can launch OpenMP tasks that concurrently access the MS; limiting to a single thread avoids potential data races in the ADIOS2 engine.
os.environ['OMP_NUM_THREADS']alone is insufficient becauselibgompreads the variable only once (at the first OpenMP call, typically duringimport casatools). This helper therefore also callsomp_set_num_threads(1)viactypesto override the cached value immediately.- Return type:
None
ADIOS2 Conversion¶
Convert a MeasurementSet to use the Adios2StMan storage manager.
- pclean.utils.convert_adios2.convert_ms_to_adios2(input_ms, output_ms, *, target_columns=('DATA', 'CORRECTED_DATA', 'MODEL_DATA', 'FLAG', 'WEIGHT', 'SIGMA'), overwrite=False, engine_type='BP4', engine_params=None, adios2_xml=None, taql=None)[source]¶
Copy a MeasurementSet, rebinding heavy columns to Adios2StMan.
The function reads the existing
dminfofrom input_ms, replaces the storage-manager type for every manager that handles one of the target_columns, and performs a deepvaluecopyso that the bulk data is physically rewritten through the ADIOS2 C++ backend.Sub-tables (
ANTENNA,FIELD,SPECTRAL_WINDOW, etc.) are left on their default storage managers because their I/O footprint is negligible.Note
Adios2StMan requires the copy to happen in a single
Table::deepCopypass. Manual row-level approaches (addrows+putcol, orcopyrows) are not supported because the ADIOS2 engine needs cell shapes established through casacore’s internal copy path and does not allow reopening a table for append.Casacore’s C++
deepCopystreams data row-by-row, but the ADIOS2 BP engine accumulates allPut()data within a single step —EndStep()/Close()run only in the Adios2StMan destructor. Use engine_params to control buffer sizing.The default ADIOS2 engine (usually BP5) ignores
MaxBufferSize; it only honoursBufferChunkSize. This function defaults toBP4and passes the engine type via theENGINETYPEdminfo SPEC field so casacore’sAdios2StMan::makeObjectsets the correct engine before opening.- Parameters:
input_ms (str) – Path to the source MeasurementSet.
output_ms (str) – Destination path for the ADIOS2-backed copy.
target_columns (tuple[str, ...] | list[str]) – Column names to rebind to Adios2StMan.
overwrite (bool) – If
True, remove output_ms if it already exists.engine_type (str) – ADIOS2 engine type.
'BP4'is recommended because BP4 respectsMaxBufferSizeand flushes to disk when the buffer exceeds that cap. BP5 uses a different allocation model (seeBufferChunkSize).engine_params (dict[str, str] | None) –
ADIOS2 engine parameters. Useful keys:
MaxBufferSize— triggers flush when exceeded (BP4 only, e.g.'2Gb').InitialBufferSize— starting allocation (BP4).BufferGrowthFactor— growth multiplier (BP4).BufferChunkSize— per-chunk size (BP5).
adios2_xml (str | None) – Path to a user-supplied ADIOS2 XML config file. If provided, engine_type and engine_params are ignored.
taql (str | None) – Optional TaQL
WHEREclause to select a subset of rows before copying (e.g.'DATA_DESC_ID IN [0]'). When set,tb.query(taql)is used as the copy source, so only matching rows are written. Sub-tables are copied as-is.
- Returns:
The output_ms path on success.
- Raises:
FileNotFoundError – If input_ms does not exist.
FileExistsError – If output_ms exists and overwrite is
False.RuntimeError – If Adios2StMan is not available in the current build.
- Return type:
- pclean.utils.convert_adios2.split_and_convert_ms_to_adios2(input_ms, output_dir, *, target_columns=('DATA', 'CORRECTED_DATA', 'MODEL_DATA', 'FLAG', 'WEIGHT', 'SIGMA'), overwrite=False, engine_type='BP4', engine_params=None, adios2_xml=None)[source]¶
Select rows by SPW and convert each subset to Adios2StMan in one pass.
This implements Workaround 3 from the Adios2StMan debug notes. For each SPW the function builds a TaQL
DATA_DESC_ID IN [...]clause and passes it to convert_ms_to_adios2 via the taql parameter. The row selection and ADIOS2 rebinding happen in a singledeepCopy— no intermediate MS is written.Sub-tables (
SPECTRAL_WINDOW,DATA_DESCRIPTION, etc.) are copied as-is and therefore still contain entries for all SPWs. This is cosmetic; the imager only accesses rows present in the main table.- Parameters:
input_ms (str) – Path to the source MeasurementSet.
output_dir (str) – Directory under which per-SPW ADIOS2 datasets are written (
<output_dir>/<basename>_spw<N>.ms).target_columns (tuple[str, ...] | list[str]) – Columns to rebind to Adios2StMan.
overwrite (bool) – If
True, remove existing outputs.engine_type (str) – ADIOS2 engine type (forwarded to convert_ms_to_adios2).
engine_params (dict[str, str] | None) – ADIOS2 engine parameters.
adios2_xml (str | None) – Path to a user-supplied ADIOS2 XML config.
- Returns:
List of output ADIOS2-backed MS paths.
- Return type: