Data Model

Analysing the results from simple climate models involves a lot of timeseries handling, including:

  • filtering

  • plotting

  • resampling

  • serialization/deserialisation

  • computation

As a result, scmdata’s approach to data handling focusses on efficient handling of timeseries.

The ScmRun class

The scmdata.ScmRun class represents a collection of timeseries data including metadata and provides methods for manipulating the data. Internally, ScmRun stores the timeseries data in a single pandas.DataFrame and the timeseries metadata pandas.MultiIndex of type pandas.Categorical, for efficient indexing.

This class is the primary way of handling timeseries data within the scmdata package. For example, the ScmRun can be filtered to only find the subset of data which have a "scenario" metadata label equal to "green" (see ScmRun.filter for full details). Other operations include grouping, setting and (basic) plotting.

The complete set of manipulation features can be found in the documentation pages of ScmRun.

ScmRun has three key properties and one key method, which allow the user to quickly access their data in more standard formats:

  • values returns all of the timeseries as a single numpy.ndarray without any metadata or indication of the time axis.

  • meta returns all of the timeseries’ metadata as a single pandas.DataFrame. This allows users to quickly have an overview of the timeseries held by scmdata.ScmRun without having to also view the data itself.

  • metadata < stores run-specific metadata, i.e. metadata which isn’t tied to any timeseries specifically.

  • timeseries() combines values and meta to form a pandas.DataFrame whose index is equal to scmdata.ScmRun.meta and whose values are equal to scmdata.ScmRun.values. The columns of the output of timeseries() are the time axis of the data.

Metadata handling

scmdata can store any kind of metadata about the timeseries, without restriction. This combination allows it to be a high performing, yet flexible library for timeseries data.

However, to do this it must make assumptions about the type of data it holds and these assumptions come with tradeoffs. In particular, scmdata cannot hold metadata at a level finer than a complete timeseries. For example, it couldn’t handle a case where one point in a timeseries needed to be labelled with an ‘erroneous’ label. In such a case the entire timeseries would have to be labelled ‘erroneous’ (or a new timeseries made with just that data point, which may not be very performant). If behaviour of this type is required, we suggest trying another data handling approach.

The ScmDatabase class

When handling large datasets which may not fit into memory, it is important to be able to query subsets of the dataset without having to iterate over the entire dataset. scmdata.database.ScmDatabase helps with this issue by disaggregating a dataset into subsets according to unique combinations of metadata. The metadata of interest is specified by the user so that the database can be adapted to any use-case or access pattern.

One of the major benefits of scmdata.database.ScmDatabase is that the taxonomy of metadata does not need to be known at database creation making it easy to add new data to the database. Each unique subset of the database is stored as a single netCDF file. This ensures that if timeseries with new metadata are saved to the database, the existing files in the database do not need to be modified. Instead new files are written expanding the directory structure to accommodate the new metadata values.

Filtering using the metadata columns of interest is also very simple as the contents of a given file can be determined from the directory structure without having to load the file. Each file can then be loaded as the data is needed, minimising the need for reading data which will then immediately be filtered away of extra data that is needed to be unnecessarily read and then filtered away.