MVF version 4 (MeerKAT)¶
The version 4 format is the standard format for MeerKAT visibility data. Unlike previous versions, the data for an observation does not reside in a single HDF5 file, as such files would be unmanageably large. Instead, the data is split into chunks, each in its own file, which are loaded from disk or the network on demand. For this reason, the term “data set” is preferred over “file”.
Concepts¶
- Streams
- A stream is a collection of data and any associated metadata, whether multicast, queriable (e.g., sensors) or stored on disk. Every stream in a subarray product has a unique name and a type. A stream may consist of multiple items of related data e.g., visibilities, flags and weights may form a single stream.
- Subarray product
- A collection of streams in the MeerKAT Science Data Processor (SDP) forms a subarray product.
- Capture block
A capture block is a contiguous period over which data is captured from a specific subarray product. A subarray product can only capture one capture block at a time.
Each visibility belongs to a specific stream and capture block within a subarray product.
Capture block IDs are currently numbers representing the start time in the UNIX epoch, but they should be treated as opaque strings.
- Chunk store
- A location (such as a local disk or the MeerKAT archive) that stores the data from a capture block.
Metadata¶
The metadata for a data set is stored in a Redis dump file
(extension .rdb
), which is exported by
katsdptelstate. Refer to
katsdptelstate for details of how
attributes and sensors are encoded in the Redis database.
A single .rdb
file contains metadata for a single subarray but
potentially for multiple streams and capture blocks. The default capture
block and stream to access are stored in capture_block_id
and
stream_name
.
Keys are stored in one of the following namespaces:
- the global namespace
stream_name
(the “stream namespace”)capture_block_id
(the “capture-block namespace”)capture_block_id.stream_name
(the “capture-stream namespace”)
Here .
is used to indicate a sub-namespace, but the actual separator
is subject to change and one should always use
join()
to
construct compound namespace names.
Keys may move between these namespaces without notice. Readers should
search for keys from the most specific to the least specific appropriate
namespace (see for example
katdal.datasources.view_capture_stream()
).
Where values contain strings, they might contain either raw bytes (which should be decoded as UTF-8) or Unicode text. Readers should be prepared to accept either. The goal is to eventually migrate all such fields to use text. katdal recursively converts all strings to the Python interpreter’s native string type.
katsdptelstate stores two types of values: immutable “attributes”, and “sensors” which are lists of timestamped values. In the documentation below, most keys contain attributes, and sensors are indicated.
Global metadata¶
A subset of the sensors in the MeerKAT system are stored in the file, in the global namespace. Documenting the MeerKAT sensors is beyond the scope of this documentation.
The following keys are also stored.
sdp_config
(dict)- The JSON object used to configure the SDP subarray product. It is not intended to be parsed (the relevant information is intended to be available via other means), but it contains a wealth of debugging information about the streams and connections between them.
sdp_capture_block_id
(string) — sensor- All capture block IDs seen so far. This should not be confused with
capture_block_id
, which indicates the default capture block ID that should be consulted when the file is opened without specifying a capture block ID. sdp_image_tag
(string)- The Docker image tag for the Docker images forming the realtime SDP capture system. This is the closest thing to a “version number” for the implementation.
sdp_image_overrides
(dict)- Alternative Docker image tags for specific services within SDP,
overriding
sdp_image_tag
. Overriding individual images is a debugging tool and it should always be empty for science observations. config.*
(dict)- Command-line options passed to each of the services within SDP.
sdp_task_details
(dict)- Debug information about each of the services launched for the subarray product, including the host on which it ran and the Mesos TaskInfo structure.
Common stream metadata¶
The list of streams that can be accessed from the archive is available
in sdp_archived_streams
(in the global namespace). Within each
stream, the following keys may be defined (not all make sense for
every stream type).
Only stream_type
and src_streams
are guaranteed to be in the
stream namespace, i.e. independent of the capture block. The others may
appear either in the capture-stream namespace or the stream namespace.
inherit
(string)If present, it indicates another stream from which this stream inherits properties. Any property that cannot be found in the namespace of the current stream should first be looked up in that stream’s namespace.
This is typically used where a single multicast stream is recorded in multiple places. Each copy inherits the majority of metadata from the original and overrides a few keys.
stream_type
(string)Valid values are
sdp.vis
- Uncalibrated visibilities, flags and weights
sdp.flags
- Similar to
sdp.vis
, but containing only flags sdp.cal
- Calibration solutions. Older files may contain a
cal
stream which omits the stream information and which does not appear insdp_archived_streams
, so that should be considered as a fallback. sdp.continuum_image
- Continuum image (as a list of CLEAN components) and self-calibration solutions. FITS files will be stored in the MeerKAT archive but katdal does not currently support accessing them.
sdp.spectral_image
- Spectral-line image. FITS files will be stored in the MeerKAT archive but katdal does not currently support accessing them.
src_streams
(list of string)- The streams from which the current stream was computed. These are
not necessarily listed in
sdp_archived_streams
, particularly if they were produced by the MeerKAT Correlator/Beamformer (CBF) rather than the SDP. n_chans
(int)- Number of channels in a channelised product.
n_chans_per_substream
(int)- Number of channels in each SPEAD heap. Not relevant when loading archived data.
bandwidth
(float, Hz)- Bandwidth of the stream.
center_freq
(float, Hz)- Middle of the central channel. Note that if the number of channels is even, this is actually half a channel higher than the middle of the band.
channel_range
(int, int)- A half-open range of channels taken from the source stream. The
length of this range might not equal
n_chans
due to channel averaging.
Visibility stream metadata¶
The following are relevant to sdp.vis
and sdp.flags
streams.
n_bls
(int)- Number of baselines. Note that a baseline is a correlation between two polarised inputs (a single entry in a Jones matrix).
bls_ordering
(either a list of string pairs or a 2D array)- An array of pairs of strings. Each pair names two antenna inputs
that form a baseline. There will be
n_bls
rows. Note that this can be either a list of 2-element lists or a NumPy array. sync_time
,int_time
,first_timestamp
(float)- Refer to Timestamps below.
excise
(bool)- True if RFI detected in the source stream is excised during time and channel averaging. If missing, assume it is true.
calibrations_applied
(list of string)- Names of
sdp.cal
streams whose corrections have been applied to the data. need_weights_power_scale
(bool)- Refer to Weights below. If missing, assume it is false.
s3_endpoint_url
(string),chunk_info
- Refer to Data below.
Calibration solutions¶
Streams of type sdp.cal
have the following keys.
antlist
(list of string, length n_ants)- List of antenna names. Arrays of calibration solutions use this order along the antenna axis.
pol_ordering
(list of string, length n_pols)- List of polarisations (from
v
andh
). Arrays of calibration solutions use this order along the polarisation axis. bls_ordering
(either a list of string pairs or a 2D array)- Same meaning as for
sdp.vis
streams, but describes the internal ordering used within the calibration pipeline and not of much use to users. param_*
- Parameters used to configure the calibration.
refant
(string)- Name of the selected reference antenna (which will also appear in
antlist
). The reference antenna is only chosen when first needed in a capture block, so this key may be absent if there was no calibration yet. In older datasets this key contains the katpoint antenna description string instead of the name. product_G
(array of complex, shape (n_pols, n_ants)) — sensor- Gain solutions (derived e.g. on a phase calibrator), indexed by polarisation and antenna. The complex values in the array apply to the entire band.
product_K
(array of float, shape (n_pols, n_ants)) — sensor- Delay solutions (in seconds), indexed by polarisation and antenna. To correct data at frequency \(\nu\), multiply it by \(e^{-2\pi i\cdot K\cdot \nu}\).
product_B_parts
(int)- Number of keys across which bandpass-like solutions are split.
product_BN
(array of complex, shape (n_chans, n_pols, n_ants)) — sensorBandpass solutions, indexed by channel, polarisation and antenna.
For implementation reasons, the bandpass solutions are split across multiple keys. N is in the range [0,
product_B_parts
), and these pieces should be concatenated along the channel (first) axis to reconstruct the full solution. If some pieces are missing (which is rare but can occur), they should be assumed to have the same shape as the present pieces.product_KCROSS_DIODE
(array of float, shape (n_pols, n_ants)) — sensorCross-hand delay solutions (in seconds), indexed by polarisation and antenna. Derived using noise diode firings.
Data at a given frequency is corrected in the same manner as
product_K
. One polarisation will serve as the reference polarisation and have all zero solutions.product_KCROSS
(array of float, shape (n_pols, n_ants)) — sensorCross-hand delay solutions (in seconds), indexed by polarisation and antenna.
Solutions are similar to
product_KCROSS_DIODE
but solved for using a celestial source instead of a noise diode.product_BCROSS_DIODEN
(array of complex, shape (n_chans, n_pols, n_ants)) — sensorCross-hand bandpass phase solutions, indexed by channel, polarisation and antenna.
Amplitudes for these solutions should always be one. One polarisation will serve as the reference polarisation and have all zero phase solutions.
As for
product_BN`
the cross-hand bandpass solutions are split across multiple keys indexed by N, where N is in the range [0,product_B_parts
). The full solution should be reconstructed as forproduct_BN
, by concatenating along the channel (first) axis.shared_solve_*N*
,last_dump_index*N*
- These are used for internal communication between the calibration processes, and are not intended for external use.
Some common points to note that about the solutions:
- Solutions describe the systematic errors. To correct data, it must be divided by the solutions.
- The key will only be present if at least one solution was computed.
- The timestamp associated with each sensor value is the timestamp of the middle of the data that was used to compute the solution.
- Solutions may contain NaN values, which indicates that there was insufficient information to compute a solution (for example, because all the data was flagged).
- Solutions are only valid as long as the system gain controls are not altered. Re-using gains from one capture block to correct data from another capture block may yield incorrect results unless one takes extra steps to correct for changes in the system gains.
Image stream metadata¶
The following apply to sdp.continuum_image
and sdp.spectral_image
streams.
target_list
(dict)- This is only applicable for imaging streams. Each key is a katpoint target description and the value is the normalised target name, which is a string used to form target-specific sub-namespaces of the stream and capture-stream namespaces. A normalised target name looks similar to the target name but has a limited character set (suitable for forming filenames and telstate namespaces) and, where necessary, a sequence number appended to ensure uniqueness.
For each sdp.continuum_image
stream, there is a sub-namespace per target
(named with the normalised target name) with the following keys (keeping
in mind that .
is used to indicate whichever separator is in use by
katsdptelstate for this database):
target0.clean_components
(dict)Image of the target field as a set of point sources. The
target0
sub-namespace is used to allow for possible alternative ways to run the continuum imager in which a single execution would image multiple fields, in which case there would betargetN
sub-namespaces up to some N. This is not currently expected for MeerKAT science observations.The dictionary has two keys:
Each sub-namespace per target contains a further sub-sub-namespace called
selfcal
that contains the self-calibration solutions. It behaves like
an sdp.cal
stream namespace and has the following keys:
antlist
(list of string, length n_ants)- List of antenna names. Arrays of self-calibration solutions use this order along the antenna axis.
pol_ordering
(list of string, length n_pols)- List of polarisations (from
v
andh
). Arrays of self-calibration solutions use this order along the polarisation axis. n_chans
(int)- Number of channels in the self-calibration solutions, which corresponds to the number of “IFs” or sub-bands in the continuum imager.
bandwidth
(float, Hz)- Bandwidth of the self-calibration solutions.
center_freq
(float, Hz)- Middle of the central channel. Note that if the number of channels is even, this is actually half a channel higher than the middle of the band.
product_GPHASE
(array of complex, shape (n_chans, n_pols, n_ants)) — sensorPhase-only self-calibration solutions, indexed by channel, polarisation and antenna.
Amplitudes for these solutions will be very close to one (to within numerical precision).
product_GAMP_PHASE
(array of complex, shape (n_chans, n_pols, n_ants)) — sensor- Amplitude + phase self-calibration solutions, indexed by channel, polarisation and antenna.
Timestamps¶
Timestamps are not stored explicitly. Instead, the first timestamp and the interval between dumps are stored, from which timestamps can be synthesised. The ith dump has a central timestamp (in the UNIX epoch) of \(\text{sync_time} + \text{first_timestamp} + i \times \text{int_time}\). The split of the initial timestamp into two parts is for technical reasons.
There is also first_timestamp_adc
, which is the same as
first_timestamp
but in units of the digitiser ADC counts. It is
stored only for internal implementation reasons and should not be relied
upon.
Light RDB files¶
The MeerKAT system also writes a “light” version of each RDB file, which contains only a subset of the keys. It is intended to contain enough information to read the uncalibrated visibilities and some high-level metadata about the observation itself. It does not contain information about antenna pointing, calibration, or CLEAN components.
Data¶
Visibilities, flags and weights are subdivided into small chunks. The
chunking model is based on dask. Visibilities are treated as a 3D
array, with axes for time, frequency and baseline. The data is divided
into pieces along each axis. Each piece is stored in a separate file
in the archive, in .npy format. The metadata necessary to reconstruct
the array is stored in the telescope state and documented in more detail
later. It is possible that some chunks will be missing, because they
were lost during the capture process. On load, katdal will replace such
chunks with default values and set the data_lost
flag for them.
Weights and flags are similarly treated.
Chunks are named type/AAAAA_BBBBB_CCCCC.npy
where type
is one of correlator_data
(visibilities), flags
, weights
;
and AAAAA, BBBBB and CCCCC are the (zero-based) indices of the
first element in the chunk along each axis, padded to a minimum of five digits.
Additionally, there are chunks named
weights_channel/AAAAA_BBBBB.npy
, explained below.
Note that the chunking scheme typically differs between visibilities, flags and weights, so files with the same base name start at the same point but do not necessarily have the same extent.
All the data for one stream is located in a single chunk store. If it is
in the MeerKAT archive, the URL to the base of this chunk store
(implementing the S3 protocol) is stored in s3_endpoint_url
.
Capture-stream specific information is stored in chunk_info
, a
two-level dictionary. The outer key is the type listed above, and the
inner key is one of:
prefix
(string)- A path prefix for the data. In the case of S3, this is the bucket
name. For local storage, it is a directory name (the parent of the
type
directory). dtype
(string)- Numpy dtype of the data, which is expected to match the dtype encoded in the individual chunk files.
shape
(tuple)- Shape of the virtual dask array obtained by joining together all the chunks.
chunks
(tuple of tuples)- Sizes of the chunks along each axis, in the format used by dask.
Weights¶
To save space, the weights are represented in an indirect form that requires some calculation to reconstruct. The actual weight for a visibility is the product of three values:
- The value in the
weights
chunk. - A baseline-independent value in the
weights_channel
chunk. - If the stream has a
need_weights_power_scale
key in telstate and the value is true, the inverse of the product of the autocorrelation power for the two inputs in the baseline.
Flags¶
Each flag is a bitfield. The meaning of the individual bits is
documented in the katdal.flags
module. Note that it is possible
that a flag chunk is present but the corresponding visibility or weight
data is missing, in which case it is the reader’s responsibility to set
the data_lost
bit.
The MeerKAT Science Data Processor typically uses two levels of
flagging: a conservative first-pass flagger run directly on the
correlator output, and a more accurate flagger that operates on
data that has been averaged and (in some cases) calibrated. The latter
appears in a stream of type sdp.flags
, which contains only flags. It
can be linked to the corresponding visibilities and weights by checking
its source streams. The flags in this stream are a
superset of the flags in the originating stream and are guaranteed to
have the same timestamp and frequency metadata, so can be used in place
of the original flags. However, due to data loss it is possible that
the replacement flags will have slightly more or fewer dumps at the end,
which will need to be handled.