scmdata.database

Database for handling large datasets in a performant, but flexible way

Data is chunked using unique combinations of metadata. This allows for the database to expand as new data is added without having to change any of the existing data.

Subsets of data are also able to be read without having to load all the data and then filter. For example, one could save model results from a number of different climate models and then load just the Surface Temperature data for all models.

class scmdata.database.DatabaseBackend(**kwargs)[source]

Bases: abc.ABC

Abstract backend for serialising/deserialising data

Data is stored as objects represented by keys. These keys can be used later to load data.

delete(key)[source]

Delete a given key

Parameters

key (str) –

abstract get(filters)[source]

Get all matching keys for a given filter

Parameters

filters (dict of str) – String filters If a level is missing then all values are fetched

Returns

Each item is a key which may contain data which is of interest

Return type

list of str

abstract load(key)[source]

Load data at a given key

Parameters

key (str) – Key to load

Returns

Return type

scmdata.ScmRun

abstract save(sr)[source]

Save data

Parameters

sr (scmdata.ScmRun) –

Returns

Key where the data is stored

Return type

str

class scmdata.database.NetCDFBackend(**kwargs)[source]

Bases: scmdata.database.DatabaseBackend

On-disk database handler for outputs from SCMs

Data is split into groups as specified by levels. This allows for fast reading and writing of new subsets of data when a single output file is no longer performant or data cannot all fit in memory.

delete(key)[source]

Delete a key

Parameters

key (str) –

get(filters)[source]

Get all matching objects for a given filter

Parameters

filters (dict of str) – String filters If a level is missing then all values are fetched

Returns

Return type

list of str

get_key(sr)[source]

Get key where the data will be stored

The key is the root directory joined with the other information provided. The filepath is also cleaned to remove spaces and special characters.

Parameters

sr (scmdata.ScmRun) – Data to save

Raises
  • ValueError – If non-unique metadata is found for each of self.kwargs["levels"]

  • KeyError – If missing metadata is found for each of self.kwargs["levels"]

Returns

Path in which to save the data without spaces or special characters

Return type

str

load(key)[source]
Parameters

key (str) –

Returns

Return type

scmdata.ScmRun

save(sr)[source]

Save a ScmRun to the database

The dataset should not contain any duplicate metadata for the database levels

Parameters

sr (scmdata.ScmRun) – Data to save

Raises
  • ValueError – If duplicate metadata are present for the requested database levels

  • KeyError – If metadata for the requested database levels are not found

Returns

Key where the data is saved

Return type

str

class scmdata.database.ScmDatabase(root_dir, levels=('climate_model', 'variable', 'region', 'scenario'), backend='netcdf', backend_config=None)[source]

Bases: object

On-disk database handler for outputs from SCMs

Data is split into groups as specified by levels. This allows for fast reading and writing of new subsets of data when a single output file is no longer performant or data cannot all fit in memory.

available_data()[source]

Get all the data which is available to be loaded

If metadata includes non-alphanumeric characters then it might appear modified in the returned table. The original metadata values can still be used to filter data.

Returns

Return type

pd.DataFrame

delete(**filters)[source]

Delete data from the database

Parameters

filters (dict of str) –

Filters for the data to load.

Defaults to deleting all data if nothing is specified.

Raises

ValueError – If a filter for a level not in levels is specified

load(disable_tqdm=False, **filters)[source]

Load data from the database

Parameters
  • disable_tqdm (bool) – If True, do not show the progress bar

  • filters (dict of str : [str, list[str]]) –

    Filters for the data to load.

    Defaults to loading all values for a level if it isn’t specified.

    If a filter is a list then OR logic is applied within the level. For example, if we have scenario=["ssp119", "ssp126"] then both the ssp119 and ssp126 scenarios will be loaded.

Returns

Loaded data

Return type

scmdata.ScmRun

Raises

ValueError – If a filter for a level not in levels is specified If no data matching filters is found

property root_dir

Root directory of the database.

Returns

Return type

str

save(scmrun, disable_tqdm=False)[source]

Save data to the database

The results are saved with one file for each unique combination of levels in a directory structure underneath root_dir.

Use available_data() to see what data is available. Subsets of data can then be loaded as an scmdata.ScmRun using load().

Parameters
  • scmrun (scmdata.ScmRun) –

    Data to save.

    The timeseries in this run should have valid metadata for each of the columns specified in levels.

  • disable_tqdm (bool) – If True, do not show the progress bar

Raises

KeyError – If a filter for a level not in levels is specified

scmdata.database.ensure_dir_exists(fp)[source]

Ensure directory exists

Parameters

fp (str) – Filepath of which to ensure the directory exists