scmdata.processing

Miscellaneous functions for processing scmdata.ScmRun

These functions are intended to be able to be used directly with scmdata.ScmRun.process_over().

scmdata.processing.calculate_crossing_times(scmrun, threshold, return_year=True)[source]

Calculate the time at which each timeseries crosses a given threshold

Parameters:

scmrun (scmdata.ScmRun) – Data to calculate the crossing time of
threshold (float) – Value to use as the threshold for crossing
return_year (bool) – If True, return the year instead of the datetime

Returns:

Crossing time for scmrun, using the meta of scmrun as the output’s index. If the threshold is not crossed, pd.NA is returned.

Return type:

pd.Series

Notes

This function only returns times that are in the columns of scmrun. If you want a finer resolution then you should interpolate your data first. For example, if you have data on a ten-year timestep but want crossing times on an annual resolution, interpolate (or resample) to annual data before calling calculate_crossing_times.

scmdata.processing.calculate_crossing_times_quantiles(crossing_times, groupby, quantiles=(0.05, 0.5, 0.95), nan_fill_value=1000000, out_nan_threshold=100000, interpolation='linear')[source]

Calculate quantiles of crossing times

This calculation is non-trivial because some timeseries may never cross a given threshold. As a result, some care is required to return sensible quantiles. In this function, the quantiles are calculated as follows:

all nan values in crossing_times are filled with nan_fill_value

quantiles are calculated using pd.groupby.quantile

quantiles which never crossed are inferred by examining whether the output values are greater than out_nan_threshold. If the calculated value is greater than out_nan_threshold then nan is returned for this quantile.

Parameters:

crossing_times (pd.Series) – Crossing times, can be calculated using scmdata.processing.calculate_crossing_times()
groupby (list[str]) – Columns to group the output by
quantiles (float) – Quantiles to calculate
nan_fill_value (float) – Value to use to fill in nan values before calculating the quantiles
out_nan_threshold (float) – Threshold to decide whether a calculated quantile should be nan or not
interpolation (str) – Interpolation to use when calculating the quantiles, see pandas.Series.quantile()

Returns:

Crossing time quantiles

Return type:

pd.Series

Raises:

NotImplementedError – crossing_times contains datetime objects, please raise an issue if this is your use case.

Examples

>>> crossing_times = pd.Series(
...     [pd.NA, pd.NA, 2100, 2007, 2006, pd.NA, 2100, 2007, 2006, 2006],
...     index=pd.MultiIndex.from_product(
...         [["a_scenario"], ["z_model", "x_model"], range(5)],
...         names=["scenario", "climate_model", "ensemble_member"]
...     )
... )
>>> crossing_times
scenario    climate_model  ensemble_member
a_scenario  z_model        0                  <NA>
                           1                  <NA>
                           2                  2100
                           3                  2007
                           4                  2006
            x_model        0                  <NA>
                           1                  2100
                           2                  2007
                           3                  2006
                           4                  2006
dtype: object
>>> scmdata.processing.calculate_crossing_times_quantiles(
...     crossing_times, groupby=["climate_model", "scenario"]
... )
climate_model  scenario    quantile
x_model        a_scenario  0.05        2006.0
                           0.50        2007.0
                           0.95           NaN
z_model        a_scenario  0.05        2006.2
                           0.50        2100.0
                           0.95           NaN

scmdata.processing.calculate_exceedance_probabilities(scmrun, threshold, process_over_cols, output_name=None)[source]

Calculate exceedance probability over all time

Parameters:

scmrun (scmdata.ScmRun) – Ensemble of which to calculate the exceedance probability
threshold (float) – Value to use as the threshold for exceedance
process_over_cols (list[str]) – Columns to not use when grouping the timeseries (typically “run_id” or “ensemble_member” or similar)
output_name (str) – If supplied, the name of the output series. If not supplied, “{threshold} exceedance probability” will be used.

Returns:

Exceedance probability over all time over all members of each group in scmrun

Return type:

pd.Series

Raises:

ValueError – scmrun has more than one variable or more than one unit (convert to a single unit before calling this function if needed)

Notes

See the notes of scmdata.processing.calculate_exceedance_probabilities_over_time() for an explanation of how the two calculations differ. For most purposes, this is the correct function to use.

scmdata.processing.calculate_exceedance_probabilities_over_time(scmrun, threshold, process_over_cols, output_name=None)[source]

Calculate exceedance probability at each point in time

Parameters:

scmrun (scmdata.ScmRun) – Ensemble of which to calculate the exceedance probability over time
threshold (float) – Value to use as the threshold for exceedance
process_over_cols (list[str]) – Columns to not use when grouping the timeseries (typically “run_id” or “ensemble_member” or similar)
output_name (str) – If supplied, the value to put in the “variable” columns of the output pd.DataFrame. If not supplied, “{threshold} exceedance probability” will be used.

Returns:

Timeseries of exceedance probability over time

Return type:

pd.DataFrame

Raises:

ValueError – scmrun has more than one variable or more than one unit (convert to a single unit before calling this function if needed)

Notes

This differs from scmdata.processing.calculate_exceedance_probabilities() because it calculates the exceedance probability at each point in time. That is different from calculating the exceedance probability by first determining the number of ensemble members which cross the threshold at any point in time and then dividing by the number of ensemble members. In general, this function will produce a maximum exceedance probability which is equal to or less than the output of scmdata.processing.calculate_exceedance_probabilities(). In our opinion, scmdata.processing.calculate_exceedance_probabilities() is the correct function to use if you want to know the exceedance probability of a scenario. This function gives a sense of how the exceedance probability evolves over time but, as we said, will generally slightly underestimate the exceedance probability over all time.

scmdata.processing.calculate_peak(scmrun, output_name=None)[source]

Calculate peak i.e. maximum of each timeseries

Parameters:

scmrun (scmdata.ScmRun) – Ensemble of which to calculate the exceedance probability over time
output_name (str) – If supplied, the value to put in the “variable” columns of the output series. If not supplied, “Peak {variable}” will be used.

Returns:

Peak of each timeseries

Return type:

pd.Series

scmdata.processing.calculate_peak_time(scmrun, output_name=None, return_year=True)[source]

Calculate peak time i.e. the time at which each timeseries reaches its maximum

Parameters:

scmrun (scmdata.ScmRun) – Ensemble of which to calculate the exceedance probability over time
output_name (str) – If supplied, the value to put in the “variable” columns of the output series. If not supplied, “Peak {variable}” will be used.
return_year (bool) – If True, return the year instead of the datetime

Returns:

Peak of each timeseries

Return type:

pd.Series

scmdata.processing.calculate_summary_stats(scmrun, index, exceedance_probabilities_thresholds=(1.5, 2.0, 2.5), exceedance_probabilities_variable='Surface Air Temperature Change', exceedance_probabilities_naming_base=None, peak_quantiles=(0.05, 0.17, 0.5, 0.83, 0.95), peak_variable='Surface Air Temperature Change', peak_naming_base=None, peak_time_naming_base=None, peak_return_year=True, categorisation_variable='Surface Air Temperature Change', categorisation_quantile_cols=('ensemble_member',), progress=False)[source]

Calculate common summary statistics

Parameters:

scmrun (scmdata.ScmRun) – Data of which to calculate the stats
index (list[str]) – Columns to use in the index of the output (unit is added if not included)
exceedance_probabilities_threshold (list[float]) – Thresholds to use for exceedance probabilities
exceedance_probabilities_variable (str) – Variable to use for exceedance probability calculations
exceedance_probabilities_naming_base (str) – String to use as the base for naming the exceedance probabilities. Each exceedance probability output column will have a name given by exceedance_probabilities_naming_base.format(threshold) where threshold is the exceedance probability threshold to use. If not supplied, the default output of scmdata.processing.calculate_exceedance_probabilities() will be used.
peak_quantiles (list[float]) – Quantiles to report in peak calculations
peak_variable (str) – Variable of which to calculate the peak
peak_naming_base (str) – Base to use for naming the peak outputs. This is combined with the quantile. If not supplied, "{} peak" is used so the outputs will be named e.g. “0.05 peak”, “0.5 peak”, “0.95 peak”.
peak_time_naming_base (str) – Base to use for naming the peak time outputs. This is combined with the quantile. If not supplied, "{} peak year" is used (unless peak_return_year is False in which case "{} peak time" is used) so the outputs will be named e.g. “0.05 peak year”, “0.5 peak year”, “0.95 peak year”.
peak_return_year (bool) – If True, return the year of the peak of peak_variable, otherwise return full dates
categorisation_variable (str) – Variable to use for categorisation. Note that this variable point to timeseries that contain global-mean surface air temperatures (GSAT) relative to 1850-1900 (using another reference period will not break this function, but is inconsistent with the original algorithm).
categorisation_quantile_cols (list[str]) – Columns which represent individual ensemble members in the output (e.g. [“ensemble_member”]). The quantiles are taking over these columns before the data is passed to scmdata.processing.categorisation_sr15().
progress (bool) – Should a progress bar be shown whilst the calculations are done?

Returns:

Summary statistics, with each column being a statistic and the index being given by index

Return type:

pd.DataFrame

scmdata.processing.categorisation_sr15(scmrun, index)[source]

Categorise using the algorithm employed in SR1.5

For more information, see the SR1.5 scenario analysis notebook.

Parameters:

scmrun – Data to use for the classification. This should contain global-mean surface air temperatures (GSAT) relative to 1850-1900 (using another reference period will not break this function, but is inconsistent with the original algorithm). The data must have a “quantile” column and it must have the 0.33, 0.5 and 0.66 quantiles calculated. This can be done with scmdata.ScmRun.quantiles_over().
index (list[str]) – Columns in scmrun.meta to use as the index of the output

Returns:

Categorisation of the timeseries

Return type:

class: pd.Series

Raises:

ValueError – More than one variable or one unit is in scmrun
DimensionalityError – The units cannot be converted to kelvin