Reading MDS Data

open_mdsdataset

All loading of data in xmitgcm occurs through the function open_mdsdataset. Its full documentation below enumerates all the possible options.

xmitgcm.open_mdsdataset(data_dir, grid_dir=None, iters='all', prefix=None, read_grid=True, delta_t=1, ref_date=None, calendar='gregorian', levels=None, geometry='sphericalpolar', grid_vars_to_coords=True, swap_dims=None, endian='>', chunks=None, ignore_unknown_vars=False, default_dtype=None, nx=None, ny=None, nz=None, llc_method='smallchunks', extra_metadata=None, extra_variables=None)[source]

Open MITgcm-style mds (.data / .meta) file output as xarray datset.

Parameters

data_dir : string

Path to the directory where the mds .data and .meta files are stored

grid_dir : string, optional

Path to the directory where the mds .data and .meta files are stored, if different from data_dir.

iters : list, optional

The iterations numbers of the files to be read. If None, no data files will be read. If 'all' (default), all iterations will be read.

prefix : list, optional

List of different filename prefixes to read. Default (None) is to read all available files.

read_grid : bool, optional

Whether to read the grid data

delta_t : number, optional

The timestep used in the model. (Can’t be inferred.)

ref_date : string, optional

An iSO date string corresponding to the zero timestep, e.g. “1990-1-1 0:0:0” (See CF conventions [R1])

calendar : string, optional

A calendar allowed by CF conventions [R1]

levels : list or slice, optional

A list or slice of the indexes of the grid levels to read Same syntax as in the data.diagnostics file

geometry : {‘sphericalpolar’, ‘cartesian’, ‘llc’, ‘curvilinear’, ‘cs’}

MITgcm grid geometry specifier

grid_vars_to_coords : boolean, optional

Whether to promote grid variables to coordinate status

swap_dims : boolean, optional

Whether to swap the logical dimensions for physical ones. If None, will be set to False for geometry==llc and True otherwise.

endian : {‘=’, ‘>’, ‘<’}, optional

Endianness of variables. Default for MITgcm is “>” (big endian)

chunks : int or dict, optional

If chunks is provided, it used to load the new dataset into dask arrays.

ignore_unknown_vars : boolean, optional

Don’t raise an error if unknown variables are encountered while reading the dataset.

default_dtype : numpy.dtype, optional

A datatype to fall back on if the metadata can’t be read.

nx, ny, nz : int, optional

The numerical dimensions of the model. These will be inferred from XC.meta and RC.meta if they are not specified. If geometry==llc, ny does not have to specified.

llc_method : {“smallchunks”, “bigchunks”}, optional

Which routine to use for reading LLC data. “smallchunks” splits the file into a individual dask chunk of size (nx x nx) for each face of each level (i.e. the total number of chunks is 13 * nz). “bigchunks” loads the whole raw data file (either into memory or as a numpy.memmap), splits it into faces, and concatenates those faces together using dask.array.concatenate. The different methods will have different memory and i/o performance depending on the details of the system configuration.

extra_metadata : dict, optional

Allow to pass information on llc type grid (global or regional). The additional metadata is typically such as :

aste = {‘has_faces’: True, ‘ny’: 1350, ‘nx’: 270,

‘ny_facets’: [450,0,270,180,450], ‘pad_before_y’: [90,0,0,0,0], ‘pad_after_y’: [0,0,0,90,90], ‘face_facets’: [0, 0, 2, 3, 4, 4], ‘facet_orders’ : [‘C’, ‘C’, ‘C’, ‘F’, ‘F’], ‘face_offsets’ : [0, 1, 0, 0, 0, 1], ‘transpose_face’ : [False, False, False,

True, True, True]}

For global llc grids, no extra metadata is required and code will set up to global llc default configuration.

extra_variables : dict, optional

Allow to pass variables not listed in the variables.py or in available_diagnostics.log. extra_variables must be a dict containing the variable names as keys with the corresponging values being a dict with the keys being dims and attrs.

Syntax: extra_variables = dict(varname = dict(dims=list_of_dims, attrs=dict(optional_attrs))) where optional_attrs can contain standard_name, long_name, units as keys

Example: extra_variables = dict( ADJtheta = dict(dims=[‘k’,’j’,’i’], attrs=dict(

standard_name=’Sensitivity_to_theta’, long_name=’Sensitivity of cost function to theta’, units=’[J]/degC’))

)

Returns

dset : xarray.Dataset

Dataset object containing all coordinates and variables.

References

R1(1,2,3)

http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch04s04.html

The optional arguments are explained in more detail in the following sections and examples.

Lazy Evaluation and Dask Chunking

open_mdsdataset does not actually load all of the data into memory when it is invoked. Rather, it opens the files using numpy.memmap, which means that the data is not read until it is actually needed for a computation. This is a cheap form of lazy evaluation.

Additional performance benefits can be achieved with xarray dask chunking:

ds_chunked = open_mdsdataset(data_dir, chunks={'Z':1, 'Zl':1})

In the example above, the each horizontal slice of the model is assigned to its own chunk; dask will automatically try to parellize operations across these chunks using all your CPU cores. For this small example dataset, no performance boost is expected (due to the overhead of parallelization), but very large simulations will certainly benefit from chunking.

When chunking is applied, the data are represetned by dask arrays, and all operations are evaluated lazily. No computation actually takes place until you call .load():

# take the mean of the squared zonal velocity
(ds_chunked.U**2).mean()
>>> <xarray.DataArray 'U' ()>
dask.array<mean_ag..., shape=(), dtype=float32, chunksize=()>
# load the value and execute the dask computational graph
>>> <xarray.DataArray 'U' ()>
array(0.00020325234800111502, dtype=float32)

Note

In certain cases, chunking is applied automatically. These cases are

  • If there is more than one timestep to be read (see Expected Files) the time dimension is automatically chunked.

  • In llc simulations, (see Grid Geometries) the face dimension is automatically chunked.

Expected Files

MITgcm writes MDS output as pairs of files with the suffixes *.data and *.meta. The model timestep iteration number is represented by a ten-digit number at the end of the filename, e.g. T.0000000090.data and T.0000000090.meta. MDS files without an iteration number are grid files.

xmitgcm has certain expectations regarding the files that will be present in datadir. In particular, it assumes datadir is an MITgcm “run” directory. By default, open_mdsdataset will read the grid files which describe the geometry of the computational domain. If these files are not present in datadir, this behavior can be turned off by setting read_grid=False.

In order to determine the dimensions of the model domain open_mdsdataset needs to peek at the metadata in two grid files: XC.meta and RC.meta. (even when read_grid=False). If these files are not available, you have the option to manually specify the parameters nx, ny, and nz as keyword arguments to open_mdsdataset. (ny is not required for geometry='llc').

By default, open_mdsdataset attempts to read all the data files in datadir. The files and iteration numbers to read are determined in the following way:

  1. First datadir is scanned to determine all iteration numbers present in the directory. To override this behavior and manually specify the iteration numbers to read, use the iters keyword argument, e.g. iters=[10,20,30].

  2. To dertmine the file prefixes to read, open_mdsdataset looks for all *.data filenames which match the first iteration number. To override this behavior and manually specify the file prefixes via the prefix keyword argument, e.g. prefix=['UTave', 'VTave'].

  3. open_mdsdataset then looks for each file prefix at each iteration number.

This approach works for the test examples, but perhaps it does not suit your model configuration. Suggestions are welcome on how to improve the discovery of files in the form of issues and pull requests.

Warning

If you have certain file prefixes that are present at the first iteration (e.g. T.0000000000.data) but not at later iterations (e.g iters=[0,10]) but there is no T.0000000010.data file, open_mdsdataset will raise an error because it can’t find the expected files. To overcome this you need to manually specify iters and / or prefix.

To determine the variable metadata, xmitgcm is able to parse the model’s available_diagnostics.log file. If you use diagnostic output, the available_diagnostics.log file corresponding with your model run should be present in datadir.

Note

If the available_diagnostics.log file can’t be found, a default version will be used. This could lead to problems, since you may have custom diagnostics enabled in your run that are not present in the default. The default available_diagnostics.log file was taken from the ECCOv4 global_oce_llc90 experiment.

For non-diagnostic output (e.g. default “state” or “timeave” output), xmitgcm assigns the variable metadata based on filenames. The additional metadata makes the internal represtation of the model variables more verbose and ensures compliance with CF Conventions.

Dimensions and Coordinates

One major advantage of using xarray to represent data is that the variable dimensions are labeled, much like netCDF data structures. This labeling enables much clearer code. For example, to take a time average of a Dataset, one just says ds.mean(dim='time') without having to remember which logical axis is the time dimension.

xmitgcm distinguishes between logical dimensions and physical dimensions or coordinates. Open open_mdsdataset will attempt to assign physical dimensions to the data. The physical dimensions correspond to the axes of the MITgcm grids in cartesian or sphericalpolar coordinates. The standard names have been assigned according to CF Conventions.

name

standard_name

time

time

XC

longitude

YC

latitude

XG

longitude_at_f_location

YG

latitude_at_f_location

Zl

depth_at_lower_w_location

Zu

depth_at_upper_w_location

Z

depth

Zp1

depth_at_w_location

layer_1RHO_center

ocean_layer_coordinate_1RHO_center

layer_1RHO_interface

ocean_layer_coordinate_1RHO_interface

layer_1RHO_bounds

ocean_layer_coordinate_1RHO_bounds

The physical dimensions of typical veriables are:

print(ds.THETA.dims)
>>> ('time', 'Z', 'YC', 'XC')
print(ds.UVEL.dims)
>>> ('time', 'Z', 'YC', 'XG')
print(ds.VVEL.dims)
>>> ('time', 'Z', 'YG', 'XC')
print(ds.WVEl.dims)
>>> ('time', 'Zl', 'YC', 'XG')

In order for physical dimensions to be assigned open_mdsdataset must be involved with read_grid=True (default). For a more minimalistic approach, one can use read_grid=False and assign logcial dimensions. Logical dimensions can also be chosen by explicitly setting swap_dims=False, even with read_grid=False. Physical dimension only work with geometry=='cartesian' or geometry=='sphericalpolar'. For geometry=='llc' or geometry=='curvilinear', it is not possible to replace the logical dimensions with physical dimensions, and setting swap_dims=False will raise an error.

Logical dimensions follow the naming conventions of the MITgcm numerical grid. The dimension names are augmented with metadata attributes from the Comodo conventions. These logical spatial dimensions are

name

standard_name

axis

c_grid_axis_shift

i

x_grid_index

X

i_g

x_grid_index_at_u_location

X

-0.5

j

y_grid_index

Y

j_g

y_grid_index_at_v_location

Y

-0.5

k

z_grid_index

Z

k_l

z_grid_index_at_lower_w_location

Z

-0.5

k_u

z_grid_index_at_upper_w_location

Z

0.5

k_p1

z_grid_index_at_w_location

Z

(-0.5, 0.5)

As explained in the Comodo documentation, the use of different dimensions is necessary to represent the fact that, in c-grid ocean models, different variables are staggered in different ways with respect to the model cells. For example, tracers and velocity components are all have different logical dimensions:

print(ds.THETA.dims)
>>> ('time', 'k', 'j', 'i')
print(ds.UVEL.dims)
>>> ('time', 'k', 'j', 'i_g'
print(ds.VVEL.dims)
>>> ('time', 'k', 'j_g', 'i')
print(ds.WVEl.dims)
>>> ('time', 'k_l', 'j', 'i')

xarray distinguishes between “coordinates” and “data_vars”. By default, open_mdsdataset will promote all grid variables to coordinates. To turn off this behavior and treat grid variables as data_vars, use grid_vars_to_coords=False.

Warning

For vertical coordinates (Zl, k_l, etc.) the notion of “lower” and “upper” always refers to the logical grid position, NOT the physical grid position. This can be a major point of confusion. In array index space, the point k_l is indeed lower than k_u; however, for typical MITgcm ocean model configurations, in physical space the point k_l is physically above k_u. This is because, the Z axis starts from the ocean surface and increases downward. In contrast, for atmospheric configurations which start from the ground and increase upwards, the logical and physical positions are self-consistent.

Time

open_mdsdataset attemts to determine the time dimension based on iters. However, addtiional input is required from the user to fully exploit this capability. If the user specifies delta_t, the numerical timestep used for the MITgcm simulation, it is used to muliply iters to determine the time in seconds. Additionally, if the user specifies ref_date (an ISO date string, e.g. “1990-1-1 0:0:0”), the time dimension will be converted into a datetime index, exposing all sorts of useful timeseries functionalty within xarray.

Grid Geometries

The grid geometry is not inferred; it must be specified via the geometry keyword. xmitgcm currently supports four MITgcm grid geometries: cartesian, sphericalpolar, curvilinear, and llc. The first two are straightforward. The curvilinear is used for curvilinear cartesian grids. The llc (“lat-lon-cap”) geometry is more complicated. This grid consists of four distinct faces of the same size plus a smaller north-polar cap. Each face has a distinct relatioship between its logical dimensions and its physical coordinates. Because netCDF and xarray.Dataset data structures do not support this sort of complex geometry (multiple faces of different sizes), our approach, inspired by nc-tiles, is to split the domain into 13 equal-sized “faces”. face then becomes an additional dimension of the data.

To download an example llc dataset, run the following shell commands:

$ curl -L -J -O https://ndownloader.figshare.com/files/6494721
$ tar -xvzf global_oce_llc90.tar.gz

And to read it, in python:

ds_llc = open_mdsdataset('./global_oce_llc90/', iters=8, geometry='llc')
print(ds_llc['S']dims)
>>> ('time', 'k', 'face', 'j', 'i')

xmitgcm is not nearly as comprehensive as gcmfaces. It does not offer sophisticated operations involving exchanges at face boundaries, integrals across sections, etc. The goal of this package is simply to read the mds data. However, by outputing an xarray data structure, we can use all of xarray’s usual capabilities on the llc data, for example:

# calculate the mean squared salinity as a function of depth
(ds_llc.S**2).mean(dim=['face', 'j', 'i'])
>>> <xarray.DataArray 'S' (time: 1, k: 50)>
dask.array<mean_ag..., shape=(1, 50), dtype=float32, chunksize=(1, 50)>
Coordinates:
  * k        (k) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
    iter     (time) int64 8
  * time     (time) int64 8
    Z        (k) >f4 -5.0 -15.0 -25.0 -35.0 -45.0 -55.0 -65.0 -75.005 ...
    PHrefC   (k) >f4 49.05 147.15 245.25 343.35 441.45 539.55 637.65 735.799 ...
    drF      (k) >f4 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.01 10.03 10.11 ...

(Note that this simple example does not perform correct volume weighting or land masking in the average.)

open_mdsdataset offers two different strategies for reading LLC data, method='smallchunks' (the default) and method='bigchunks'. The details and tradeoffs of these strategies are described in Performance Issues.