ScmDatabase

In this notebook, we provide an example of the ScmDatabase class. ScmDatabase helps read and write large bunches of timeseries data by splitting them up into multiple files on disk and allowing users to read/write selections at a time.

This allows handling very large datasets which may exceed the amount of system memory a user has available.

import tempfile
import traceback

import numpy as np
import pandas as pd

from scmdata import ScmRun, run_append
from scmdata.database import ScmDatabase
from scmdata.errors import NonUniqueMetadataError

pd.set_option("display.width", 160)
generator = np.random.default_rng(0)
/tmp/ipykernel_705/76666152.py:5: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
/home/docs/checkouts/readthedocs.org/user_builds/scmdata/checkouts/stable/src/scmdata/database/_database.py:9: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  import tqdm.autonotebook as tqdman

Initialisation

There are two main things to think about when creating a ScmDatabase. Namely:

  • Where the data is going to be stored (root_dir)

  • How the data will be split up (levels)

When data is to be written to disk it is split into different files, each with a unique combination of metadata values. The levels option defines the metadata columns used to split up the data.

Choosing an appropriate value for levels could play a large role in determining the performance of reading/writing. For example, if you were storing output from a number of different climate models, you may define levels as ["climate_model", "scenario", "variable", "region"]. This would allow loading a particular variable and region, say Surface Temperature for the World region, from all climate models and scenarios without needing to load the other variables and regions. Specifying too many groups may result in slow writing if a very large number of database files are written.

If you wish load a subset of a particular metadata dimension then it must be specified in this list.

print(ScmDatabase.__init__.__doc__)
        Initialise the database

        .. note::

            Creating a new :class:`ScmDatabase` does not modify any existing data on
            disk. To load an existing database ensure that the :attr:`root_dir`.
            :attr:`levels` and backend settings are the same as the previous instance.

        Parameters
        ----------
        root_dir : str
            The root directory of the database

        levels : tuple of str
            Specifies how the runs should be stored on disk.

            The data will be grouped by ``levels``. These levels should be adapted to
            best match the input data and desired access pattern. If there are any
            additional varying dimensions, they will be stored as dimensions.

        backend: str or :class:`BaseDatabaseBackend<scmdata.database.backends.BaseDatabaseBackend>`
            Determine the backend to serialize and deserialize data

            Defaults to using :class:`NetCDFDatabaseBackend<scmdata.database.backends.NetCDFDatabaseBackend>`
            which reads and writes data as netCDF files. Note that this requires the
            optional dependency of netCDF4 to be installed.

            If a custom backend class is being used, it must extend the
            :class:`BaseDatabaseBackend<scmdata.database.backends.BaseDatabaseBackend>` class.

        backend_config: dict
            Additional configuration to pass to the backend

            See the documentation for the target backend to determine which configuration
            options are available.


        
temp_out_dir = tempfile.TemporaryDirectory()
database = ScmDatabase(temp_out_dir.name, levels=["climate_model", "scenario"])
database
<scmdata.database.SCMDatabase (root_dir: /tmp/tmp0dng5d4h, levels: ('climate_model', 'scenario'))>

Saving data

Data can be added to the database using the save_to_database method. Subsequent calls merge new data into the database.

def create_timeseries(  # noqa: PLR0913
    n=500,
    count=1,
    b_factor=1 / 1000,
    model="example",
    scenario="ssp119",
    variable="Surface Temperature",
    unit="K",
    region="World",
    **kwargs,
):
    """
    Create an example timeseries
    """
    a = generator.random(count)
    b = generator.random(count) * b_factor
    data = a + np.arange(n)[:, np.newaxis] ** 2 * b
    index = 2000 + np.arange(n)
    return ScmRun(
        data,
        columns={
            "model": model,
            "scenario": scenario,
            "variable": variable,
            "region": region,
            "unit": unit,
            "ensemble_member": range(count),
            **kwargs,
        },
        index=index,
    )
runs_low = run_append(
    [
        create_timeseries(
            scenario="low",
            climate_model="model_a",
            count=10,
            b_factor=1 / 1000,
        ),
        create_timeseries(
            scenario="low",
            climate_model="model_b",
            count=10,
            b_factor=1 / 1000,
        ),
    ]
)
runs_high = run_append(
    [
        create_timeseries(
            scenario="high",
            climate_model="model_a",
            count=10,
            b_factor=2 / 1000,
        ),
        create_timeseries(
            scenario="high",
            climate_model="model_b",
            count=10,
            b_factor=2 / 1000,
        ),
    ]
)
run_append([runs_low, runs_high]).line_plot(hue="scenario", style="climate_model")
../_images/061cecfb81480f822af43262e385b5f2139966c7cf58ebbe3c1b2f765aed0527.png
database.save(runs_low)
Saving to database: 0it [00:00, ?it/s]
/home/docs/checkouts/readthedocs.org/user_builds/scmdata/checkouts/stable/src/scmdata/_xarray.py:236: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
  else timeseries.T.stack(dimensions)
/home/docs/checkouts/readthedocs.org/user_builds/scmdata/checkouts/stable/src/scmdata/_xarray.py:236: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
  else timeseries.T.stack(dimensions)

Saving to database: 2it [00:00, 17.99it/s]
                                          

database.available_data()
climate_model scenario
0 model_a low
1 model_b low

Internally, each row shown in available_data() is stored as a netCDF file in a directory structure following database.levels.

!pushd {temp_out_dir.name}; tree; popd
/usr/bin/sh: 1: pushd: not found
/usr/bin/sh: 1: tree: not found
/usr/bin/sh: 1: popd: not found

Additional calls to save will merge the new data into the database, creating any new files as required.

If existing data is found, it is first loaded and merged with the saved data before writing to prevent losing existing data.

database.save(runs_high)
Saving to database: 0it [00:00, ?it/s]
/home/docs/checkouts/readthedocs.org/user_builds/scmdata/checkouts/stable/src/scmdata/_xarray.py:236: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
  else timeseries.T.stack(dimensions)
/home/docs/checkouts/readthedocs.org/user_builds/scmdata/checkouts/stable/src/scmdata/_xarray.py:236: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
  else timeseries.T.stack(dimensions)

Saving to database: 2it [00:00, 17.89it/s]
                                          

database.available_data()
climate_model scenario
0 model_a high
1 model_a low
2 model_b high
3 model_b low

These data still need unique metadata otherwise a NonUniqueMetadataError is raised.

try:
    database.save(runs_high)
except NonUniqueMetadataError:
    traceback.print_exc(limit=0, chain=False)
Saving to database: 0it [00:00, ?it/s]
                                      
scmdata.errors.NonUniqueMetadataError: Duplicate metadata (numbers show how many times the given metadata is repeated).
  climate_model ensemble_member    model region scenario unit             variable  repeats
0       model_a               0  example  World     high    K  Surface Temperature        2
1       model_a               1  example  World     high    K  Surface Temperature        2
2       model_a               2  example  World     high    K  Surface Temperature        2
3       model_a               3  example  World     high    K  Surface Temperature        2
4       model_a               4  example  World     high    K  Surface Temperature        2
5       model_a               5  example  World     high    K  Surface Temperature        2
6       model_a               6  example  World     high    K  Surface Temperature        2
7       model_a               7  example  World     high    K  Surface Temperature        2
8       model_a               8  example  World     high    K  Surface Temperature        2
9       model_a               9  example  World     high    K  Surface Temperature        2
runs_high_extra = runs_high.copy()
runs_high_extra["ensemble_member"] = runs_high_extra["ensemble_member"] + 10
database.save(runs_high_extra)
Saving to database: 0it [00:00, ?it/s]
/home/docs/checkouts/readthedocs.org/user_builds/scmdata/checkouts/stable/src/scmdata/_xarray.py:236: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
  else timeseries.T.stack(dimensions)
Saving to database: 1it [00:00,  9.67it/s]
/home/docs/checkouts/readthedocs.org/user_builds/scmdata/checkouts/stable/src/scmdata/_xarray.py:236: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
  else timeseries.T.stack(dimensions)
Saving to database: 2it [00:00,  9.63it/s]
                                          

Loading data

When loading data we can select a subset of data, similar to ScmRun.filter but limited to filtering for the metadata columns as specified in levels

run = database.load(scenario="high")
run.meta
Loading files:   0%|          | 0/2 [00:00<?, ?it/s]
                                                    

climate_model ensemble_member model region scenario unit variable
0 model_b 0 example World high K Surface Temperature
1 model_b 1 example World high K Surface Temperature
2 model_b 2 example World high K Surface Temperature
3 model_b 3 example World high K Surface Temperature
4 model_b 4 example World high K Surface Temperature
5 model_b 5 example World high K Surface Temperature
6 model_b 6 example World high K Surface Temperature
7 model_b 7 example World high K Surface Temperature
8 model_b 8 example World high K Surface Temperature
9 model_b 9 example World high K Surface Temperature
10 model_b 10 example World high K Surface Temperature
11 model_b 11 example World high K Surface Temperature
12 model_b 12 example World high K Surface Temperature
13 model_b 13 example World high K Surface Temperature
14 model_b 14 example World high K Surface Temperature
15 model_b 15 example World high K Surface Temperature
16 model_b 16 example World high K Surface Temperature
17 model_b 17 example World high K Surface Temperature
18 model_b 18 example World high K Surface Temperature
19 model_b 19 example World high K Surface Temperature
20 model_a 0 example World high K Surface Temperature
21 model_a 1 example World high K Surface Temperature
22 model_a 2 example World high K Surface Temperature
23 model_a 3 example World high K Surface Temperature
24 model_a 4 example World high K Surface Temperature
25 model_a 5 example World high K Surface Temperature
26 model_a 6 example World high K Surface Temperature
27 model_a 7 example World high K Surface Temperature
28 model_a 8 example World high K Surface Temperature
29 model_a 9 example World high K Surface Temperature
30 model_a 10 example World high K Surface Temperature
31 model_a 11 example World high K Surface Temperature
32 model_a 12 example World high K Surface Temperature
33 model_a 13 example World high K Surface Temperature
34 model_a 14 example World high K Surface Temperature
35 model_a 15 example World high K Surface Temperature
36 model_a 16 example World high K Surface Temperature
37 model_a 17 example World high K Surface Temperature
38 model_a 18 example World high K Surface Temperature
39 model_a 19 example World high K Surface Temperature
database.load(climate_model="model_b").meta
Loading files:   0%|          | 0/2 [00:00<?, ?it/s]
                                                    

climate_model ensemble_member model region scenario unit variable
0 model_b 0 example World high K Surface Temperature
1 model_b 1 example World high K Surface Temperature
2 model_b 2 example World high K Surface Temperature
3 model_b 3 example World high K Surface Temperature
4 model_b 4 example World high K Surface Temperature
5 model_b 5 example World high K Surface Temperature
6 model_b 6 example World high K Surface Temperature
7 model_b 7 example World high K Surface Temperature
8 model_b 8 example World high K Surface Temperature
9 model_b 9 example World high K Surface Temperature
10 model_b 10 example World high K Surface Temperature
11 model_b 11 example World high K Surface Temperature
12 model_b 12 example World high K Surface Temperature
13 model_b 13 example World high K Surface Temperature
14 model_b 14 example World high K Surface Temperature
15 model_b 15 example World high K Surface Temperature
16 model_b 16 example World high K Surface Temperature
17 model_b 17 example World high K Surface Temperature
18 model_b 18 example World high K Surface Temperature
19 model_b 19 example World high K Surface Temperature
20 model_b 0 example World low K Surface Temperature
21 model_b 1 example World low K Surface Temperature
22 model_b 2 example World low K Surface Temperature
23 model_b 3 example World low K Surface Temperature
24 model_b 4 example World low K Surface Temperature
25 model_b 5 example World low K Surface Temperature
26 model_b 6 example World low K Surface Temperature
27 model_b 7 example World low K Surface Temperature
28 model_b 8 example World low K Surface Temperature
29 model_b 9 example World low K Surface Temperature

The entire dataset can also be loaded if needed. This may not be possible for very large datasets depending on the amount of system memory available.

all_data = database.load()
all_data.meta
Loading files:   0%|          | 0/4 [00:00<?, ?it/s]
Loading files:  75%|███████▌  | 3/4 [00:00<00:00, 25.62it/s]
                                                            

climate_model ensemble_member model region scenario unit variable
0 model_b 0 example World high K Surface Temperature
1 model_b 1 example World high K Surface Temperature
2 model_b 2 example World high K Surface Temperature
3 model_b 3 example World high K Surface Temperature
4 model_b 4 example World high K Surface Temperature
5 model_b 5 example World high K Surface Temperature
6 model_b 6 example World high K Surface Temperature
7 model_b 7 example World high K Surface Temperature
8 model_b 8 example World high K Surface Temperature
9 model_b 9 example World high K Surface Temperature
10 model_b 10 example World high K Surface Temperature
11 model_b 11 example World high K Surface Temperature
12 model_b 12 example World high K Surface Temperature
13 model_b 13 example World high K Surface Temperature
14 model_b 14 example World high K Surface Temperature
15 model_b 15 example World high K Surface Temperature
16 model_b 16 example World high K Surface Temperature
17 model_b 17 example World high K Surface Temperature
18 model_b 18 example World high K Surface Temperature
19 model_b 19 example World high K Surface Temperature
20 model_b 0 example World low K Surface Temperature
21 model_b 1 example World low K Surface Temperature
22 model_b 2 example World low K Surface Temperature
23 model_b 3 example World low K Surface Temperature
24 model_b 4 example World low K Surface Temperature
25 model_b 5 example World low K Surface Temperature
26 model_b 6 example World low K Surface Temperature
27 model_b 7 example World low K Surface Temperature
28 model_b 8 example World low K Surface Temperature
29 model_b 9 example World low K Surface Temperature
30 model_a 0 example World high K Surface Temperature
31 model_a 1 example World high K Surface Temperature
32 model_a 2 example World high K Surface Temperature
33 model_a 3 example World high K Surface Temperature
34 model_a 4 example World high K Surface Temperature
35 model_a 5 example World high K Surface Temperature
36 model_a 6 example World high K Surface Temperature
37 model_a 7 example World high K Surface Temperature
38 model_a 8 example World high K Surface Temperature
39 model_a 9 example World high K Surface Temperature
40 model_a 10 example World high K Surface Temperature
41 model_a 11 example World high K Surface Temperature
42 model_a 12 example World high K Surface Temperature
43 model_a 13 example World high K Surface Temperature
44 model_a 14 example World high K Surface Temperature
45 model_a 15 example World high K Surface Temperature
46 model_a 16 example World high K Surface Temperature
47 model_a 17 example World high K Surface Temperature
48 model_a 18 example World high K Surface Temperature
49 model_a 19 example World high K Surface Temperature
50 model_a 0 example World low K Surface Temperature
51 model_a 1 example World low K Surface Temperature
52 model_a 2 example World low K Surface Temperature
53 model_a 3 example World low K Surface Temperature
54 model_a 4 example World low K Surface Temperature
55 model_a 5 example World low K Surface Temperature
56 model_a 6 example World low K Surface Temperature
57 model_a 7 example World low K Surface Temperature
58 model_a 8 example World low K Surface Temperature
59 model_a 9 example World low K Surface Temperature
all_data.line_plot(hue="scenario", style="climate_model")
../_images/974f833b6323f314b17f743c69a984a460bdbbdd2424f8b1f5beef74e6073a40.png
temp_out_dir.cleanup()