ScmDatabase
In this notebook, we provide an example of the ScmDatabase
class. ScmDatabase
helps read and
write large bunches of timeseries data by splitting them up into multiple files on disk and
allowing users to read/write selections at a time.
This allows handling very large datasets which may exceed the amount of system memory a user has available.
import tempfile
import traceback
import numpy as np
import pandas as pd
from scmdata import ScmRun, run_append
from scmdata.database import ScmDatabase
from scmdata.errors import NonUniqueMetadataError
pd.set_option("display.width", 160)
generator = np.random.default_rng(0)
/tmp/ipykernel_705/76666152.py:5: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
import pandas as pd
/home/docs/checkouts/readthedocs.org/user_builds/scmdata/checkouts/stable/src/scmdata/database/_database.py:9: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
import tqdm.autonotebook as tqdman
Initialisation
There are two main things to think about when creating a ScmDatabase
. Namely:
Where the data is going to be stored (
root_dir
)How the data will be split up (
levels
)
When data is to be written to disk it is split into different files, each with a unique
combination of metadata values. The levels
option defines the metadata columns used to split up
the data.
Choosing an appropriate value for levels
could play a large role in determining the performance
of reading/writing. For example, if you were storing output from a number of different climate
models, you may define levels
as ["climate_model", "scenario", "variable", "region"]
. This
would allow loading a particular variable and region, say Surface Temperature
for the World
region, from all climate models and scenarios without needing to load the other variables and
regions. Specifying too many groups may result in slow writing if a very large number of database
files are written.
If you wish load a subset of a particular metadata dimension then it must be specified in this list.
print(ScmDatabase.__init__.__doc__)
Initialise the database
.. note::
Creating a new :class:`ScmDatabase` does not modify any existing data on
disk. To load an existing database ensure that the :attr:`root_dir`.
:attr:`levels` and backend settings are the same as the previous instance.
Parameters
----------
root_dir : str
The root directory of the database
levels : tuple of str
Specifies how the runs should be stored on disk.
The data will be grouped by ``levels``. These levels should be adapted to
best match the input data and desired access pattern. If there are any
additional varying dimensions, they will be stored as dimensions.
backend: str or :class:`BaseDatabaseBackend<scmdata.database.backends.BaseDatabaseBackend>`
Determine the backend to serialize and deserialize data
Defaults to using :class:`NetCDFDatabaseBackend<scmdata.database.backends.NetCDFDatabaseBackend>`
which reads and writes data as netCDF files. Note that this requires the
optional dependency of netCDF4 to be installed.
If a custom backend class is being used, it must extend the
:class:`BaseDatabaseBackend<scmdata.database.backends.BaseDatabaseBackend>` class.
backend_config: dict
Additional configuration to pass to the backend
See the documentation for the target backend to determine which configuration
options are available.
temp_out_dir = tempfile.TemporaryDirectory()
database = ScmDatabase(temp_out_dir.name, levels=["climate_model", "scenario"])
database
<scmdata.database.SCMDatabase (root_dir: /tmp/tmp0dng5d4h, levels: ('climate_model', 'scenario'))>
Saving data
Data can be added to the database using the save_to_database
method. Subsequent calls merge new
data into the database.
def create_timeseries( # noqa: PLR0913
n=500,
count=1,
b_factor=1 / 1000,
model="example",
scenario="ssp119",
variable="Surface Temperature",
unit="K",
region="World",
**kwargs,
):
"""
Create an example timeseries
"""
a = generator.random(count)
b = generator.random(count) * b_factor
data = a + np.arange(n)[:, np.newaxis] ** 2 * b
index = 2000 + np.arange(n)
return ScmRun(
data,
columns={
"model": model,
"scenario": scenario,
"variable": variable,
"region": region,
"unit": unit,
"ensemble_member": range(count),
**kwargs,
},
index=index,
)
runs_low = run_append(
[
create_timeseries(
scenario="low",
climate_model="model_a",
count=10,
b_factor=1 / 1000,
),
create_timeseries(
scenario="low",
climate_model="model_b",
count=10,
b_factor=1 / 1000,
),
]
)
runs_high = run_append(
[
create_timeseries(
scenario="high",
climate_model="model_a",
count=10,
b_factor=2 / 1000,
),
create_timeseries(
scenario="high",
climate_model="model_b",
count=10,
b_factor=2 / 1000,
),
]
)
run_append([runs_low, runs_high]).line_plot(hue="scenario", style="climate_model")
database.save(runs_low)
Saving to database: 0it [00:00, ?it/s]
/home/docs/checkouts/readthedocs.org/user_builds/scmdata/checkouts/stable/src/scmdata/_xarray.py:236: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
else timeseries.T.stack(dimensions)
/home/docs/checkouts/readthedocs.org/user_builds/scmdata/checkouts/stable/src/scmdata/_xarray.py:236: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
else timeseries.T.stack(dimensions)
Saving to database: 2it [00:00, 17.99it/s]
database.available_data()
climate_model | scenario | |
---|---|---|
0 | model_a | low |
1 | model_b | low |
Internally, each row shown in available_data()
is stored as a netCDF file in a directory
structure following database.levels
.
!pushd {temp_out_dir.name}; tree; popd
/usr/bin/sh: 1: pushd: not found
/usr/bin/sh: 1: tree: not found
/usr/bin/sh: 1: popd: not found
Additional calls to save
will merge the new data into the database, creating any new files as
required.
If existing data is found, it is first loaded and merged with the saved data before writing to prevent losing existing data.
database.save(runs_high)
Saving to database: 0it [00:00, ?it/s]
/home/docs/checkouts/readthedocs.org/user_builds/scmdata/checkouts/stable/src/scmdata/_xarray.py:236: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
else timeseries.T.stack(dimensions)
/home/docs/checkouts/readthedocs.org/user_builds/scmdata/checkouts/stable/src/scmdata/_xarray.py:236: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
else timeseries.T.stack(dimensions)
Saving to database: 2it [00:00, 17.89it/s]
database.available_data()
climate_model | scenario | |
---|---|---|
0 | model_a | high |
1 | model_a | low |
2 | model_b | high |
3 | model_b | low |
These data still need unique metadata otherwise a NonUniqueMetadataError
is raised.
try:
database.save(runs_high)
except NonUniqueMetadataError:
traceback.print_exc(limit=0, chain=False)
Saving to database: 0it [00:00, ?it/s]
scmdata.errors.NonUniqueMetadataError: Duplicate metadata (numbers show how many times the given metadata is repeated).
climate_model ensemble_member model region scenario unit variable repeats
0 model_a 0 example World high K Surface Temperature 2
1 model_a 1 example World high K Surface Temperature 2
2 model_a 2 example World high K Surface Temperature 2
3 model_a 3 example World high K Surface Temperature 2
4 model_a 4 example World high K Surface Temperature 2
5 model_a 5 example World high K Surface Temperature 2
6 model_a 6 example World high K Surface Temperature 2
7 model_a 7 example World high K Surface Temperature 2
8 model_a 8 example World high K Surface Temperature 2
9 model_a 9 example World high K Surface Temperature 2
runs_high_extra = runs_high.copy()
runs_high_extra["ensemble_member"] = runs_high_extra["ensemble_member"] + 10
database.save(runs_high_extra)
Saving to database: 0it [00:00, ?it/s]
/home/docs/checkouts/readthedocs.org/user_builds/scmdata/checkouts/stable/src/scmdata/_xarray.py:236: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
else timeseries.T.stack(dimensions)
Saving to database: 1it [00:00, 9.67it/s]
/home/docs/checkouts/readthedocs.org/user_builds/scmdata/checkouts/stable/src/scmdata/_xarray.py:236: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
else timeseries.T.stack(dimensions)
Saving to database: 2it [00:00, 9.63it/s]
Loading data
When loading data we can select a subset of data, similar to ScmRun.filter
but limited to
filtering for the metadata columns as specified in levels
run = database.load(scenario="high")
run.meta
Loading files: 0%| | 0/2 [00:00<?, ?it/s]
climate_model | ensemble_member | model | region | scenario | unit | variable | |
---|---|---|---|---|---|---|---|
0 | model_b | 0 | example | World | high | K | Surface Temperature |
1 | model_b | 1 | example | World | high | K | Surface Temperature |
2 | model_b | 2 | example | World | high | K | Surface Temperature |
3 | model_b | 3 | example | World | high | K | Surface Temperature |
4 | model_b | 4 | example | World | high | K | Surface Temperature |
5 | model_b | 5 | example | World | high | K | Surface Temperature |
6 | model_b | 6 | example | World | high | K | Surface Temperature |
7 | model_b | 7 | example | World | high | K | Surface Temperature |
8 | model_b | 8 | example | World | high | K | Surface Temperature |
9 | model_b | 9 | example | World | high | K | Surface Temperature |
10 | model_b | 10 | example | World | high | K | Surface Temperature |
11 | model_b | 11 | example | World | high | K | Surface Temperature |
12 | model_b | 12 | example | World | high | K | Surface Temperature |
13 | model_b | 13 | example | World | high | K | Surface Temperature |
14 | model_b | 14 | example | World | high | K | Surface Temperature |
15 | model_b | 15 | example | World | high | K | Surface Temperature |
16 | model_b | 16 | example | World | high | K | Surface Temperature |
17 | model_b | 17 | example | World | high | K | Surface Temperature |
18 | model_b | 18 | example | World | high | K | Surface Temperature |
19 | model_b | 19 | example | World | high | K | Surface Temperature |
20 | model_a | 0 | example | World | high | K | Surface Temperature |
21 | model_a | 1 | example | World | high | K | Surface Temperature |
22 | model_a | 2 | example | World | high | K | Surface Temperature |
23 | model_a | 3 | example | World | high | K | Surface Temperature |
24 | model_a | 4 | example | World | high | K | Surface Temperature |
25 | model_a | 5 | example | World | high | K | Surface Temperature |
26 | model_a | 6 | example | World | high | K | Surface Temperature |
27 | model_a | 7 | example | World | high | K | Surface Temperature |
28 | model_a | 8 | example | World | high | K | Surface Temperature |
29 | model_a | 9 | example | World | high | K | Surface Temperature |
30 | model_a | 10 | example | World | high | K | Surface Temperature |
31 | model_a | 11 | example | World | high | K | Surface Temperature |
32 | model_a | 12 | example | World | high | K | Surface Temperature |
33 | model_a | 13 | example | World | high | K | Surface Temperature |
34 | model_a | 14 | example | World | high | K | Surface Temperature |
35 | model_a | 15 | example | World | high | K | Surface Temperature |
36 | model_a | 16 | example | World | high | K | Surface Temperature |
37 | model_a | 17 | example | World | high | K | Surface Temperature |
38 | model_a | 18 | example | World | high | K | Surface Temperature |
39 | model_a | 19 | example | World | high | K | Surface Temperature |
database.load(climate_model="model_b").meta
Loading files: 0%| | 0/2 [00:00<?, ?it/s]
climate_model | ensemble_member | model | region | scenario | unit | variable | |
---|---|---|---|---|---|---|---|
0 | model_b | 0 | example | World | high | K | Surface Temperature |
1 | model_b | 1 | example | World | high | K | Surface Temperature |
2 | model_b | 2 | example | World | high | K | Surface Temperature |
3 | model_b | 3 | example | World | high | K | Surface Temperature |
4 | model_b | 4 | example | World | high | K | Surface Temperature |
5 | model_b | 5 | example | World | high | K | Surface Temperature |
6 | model_b | 6 | example | World | high | K | Surface Temperature |
7 | model_b | 7 | example | World | high | K | Surface Temperature |
8 | model_b | 8 | example | World | high | K | Surface Temperature |
9 | model_b | 9 | example | World | high | K | Surface Temperature |
10 | model_b | 10 | example | World | high | K | Surface Temperature |
11 | model_b | 11 | example | World | high | K | Surface Temperature |
12 | model_b | 12 | example | World | high | K | Surface Temperature |
13 | model_b | 13 | example | World | high | K | Surface Temperature |
14 | model_b | 14 | example | World | high | K | Surface Temperature |
15 | model_b | 15 | example | World | high | K | Surface Temperature |
16 | model_b | 16 | example | World | high | K | Surface Temperature |
17 | model_b | 17 | example | World | high | K | Surface Temperature |
18 | model_b | 18 | example | World | high | K | Surface Temperature |
19 | model_b | 19 | example | World | high | K | Surface Temperature |
20 | model_b | 0 | example | World | low | K | Surface Temperature |
21 | model_b | 1 | example | World | low | K | Surface Temperature |
22 | model_b | 2 | example | World | low | K | Surface Temperature |
23 | model_b | 3 | example | World | low | K | Surface Temperature |
24 | model_b | 4 | example | World | low | K | Surface Temperature |
25 | model_b | 5 | example | World | low | K | Surface Temperature |
26 | model_b | 6 | example | World | low | K | Surface Temperature |
27 | model_b | 7 | example | World | low | K | Surface Temperature |
28 | model_b | 8 | example | World | low | K | Surface Temperature |
29 | model_b | 9 | example | World | low | K | Surface Temperature |
The entire dataset can also be loaded if needed. This may not be possible for very large datasets depending on the amount of system memory available.
all_data = database.load()
all_data.meta
Loading files: 0%| | 0/4 [00:00<?, ?it/s]
Loading files: 75%|███████▌ | 3/4 [00:00<00:00, 25.62it/s]
climate_model | ensemble_member | model | region | scenario | unit | variable | |
---|---|---|---|---|---|---|---|
0 | model_b | 0 | example | World | high | K | Surface Temperature |
1 | model_b | 1 | example | World | high | K | Surface Temperature |
2 | model_b | 2 | example | World | high | K | Surface Temperature |
3 | model_b | 3 | example | World | high | K | Surface Temperature |
4 | model_b | 4 | example | World | high | K | Surface Temperature |
5 | model_b | 5 | example | World | high | K | Surface Temperature |
6 | model_b | 6 | example | World | high | K | Surface Temperature |
7 | model_b | 7 | example | World | high | K | Surface Temperature |
8 | model_b | 8 | example | World | high | K | Surface Temperature |
9 | model_b | 9 | example | World | high | K | Surface Temperature |
10 | model_b | 10 | example | World | high | K | Surface Temperature |
11 | model_b | 11 | example | World | high | K | Surface Temperature |
12 | model_b | 12 | example | World | high | K | Surface Temperature |
13 | model_b | 13 | example | World | high | K | Surface Temperature |
14 | model_b | 14 | example | World | high | K | Surface Temperature |
15 | model_b | 15 | example | World | high | K | Surface Temperature |
16 | model_b | 16 | example | World | high | K | Surface Temperature |
17 | model_b | 17 | example | World | high | K | Surface Temperature |
18 | model_b | 18 | example | World | high | K | Surface Temperature |
19 | model_b | 19 | example | World | high | K | Surface Temperature |
20 | model_b | 0 | example | World | low | K | Surface Temperature |
21 | model_b | 1 | example | World | low | K | Surface Temperature |
22 | model_b | 2 | example | World | low | K | Surface Temperature |
23 | model_b | 3 | example | World | low | K | Surface Temperature |
24 | model_b | 4 | example | World | low | K | Surface Temperature |
25 | model_b | 5 | example | World | low | K | Surface Temperature |
26 | model_b | 6 | example | World | low | K | Surface Temperature |
27 | model_b | 7 | example | World | low | K | Surface Temperature |
28 | model_b | 8 | example | World | low | K | Surface Temperature |
29 | model_b | 9 | example | World | low | K | Surface Temperature |
30 | model_a | 0 | example | World | high | K | Surface Temperature |
31 | model_a | 1 | example | World | high | K | Surface Temperature |
32 | model_a | 2 | example | World | high | K | Surface Temperature |
33 | model_a | 3 | example | World | high | K | Surface Temperature |
34 | model_a | 4 | example | World | high | K | Surface Temperature |
35 | model_a | 5 | example | World | high | K | Surface Temperature |
36 | model_a | 6 | example | World | high | K | Surface Temperature |
37 | model_a | 7 | example | World | high | K | Surface Temperature |
38 | model_a | 8 | example | World | high | K | Surface Temperature |
39 | model_a | 9 | example | World | high | K | Surface Temperature |
40 | model_a | 10 | example | World | high | K | Surface Temperature |
41 | model_a | 11 | example | World | high | K | Surface Temperature |
42 | model_a | 12 | example | World | high | K | Surface Temperature |
43 | model_a | 13 | example | World | high | K | Surface Temperature |
44 | model_a | 14 | example | World | high | K | Surface Temperature |
45 | model_a | 15 | example | World | high | K | Surface Temperature |
46 | model_a | 16 | example | World | high | K | Surface Temperature |
47 | model_a | 17 | example | World | high | K | Surface Temperature |
48 | model_a | 18 | example | World | high | K | Surface Temperature |
49 | model_a | 19 | example | World | high | K | Surface Temperature |
50 | model_a | 0 | example | World | low | K | Surface Temperature |
51 | model_a | 1 | example | World | low | K | Surface Temperature |
52 | model_a | 2 | example | World | low | K | Surface Temperature |
53 | model_a | 3 | example | World | low | K | Surface Temperature |
54 | model_a | 4 | example | World | low | K | Surface Temperature |
55 | model_a | 5 | example | World | low | K | Surface Temperature |
56 | model_a | 6 | example | World | low | K | Surface Temperature |
57 | model_a | 7 | example | World | low | K | Surface Temperature |
58 | model_a | 8 | example | World | low | K | Surface Temperature |
59 | model_a | 9 | example | World | low | K | Surface Temperature |
all_data.line_plot(hue="scenario", style="climate_model")
temp_out_dir.cleanup()