scmdata.processing
Miscellaneous functions for processing scmdata.ScmRun
These functions are intended to be able to be used directly with
scmdata.ScmRun.process_over()
.
calculate_crossing_times
- calculate_crossing_times(scmrun, threshold, return_year=True)[source]
Calculate the time at which each timeseries crosses a given threshold
- Parameters:
- Returns:
pd.Series
– Crossing time forscmrun
, using the meta ofscmrun
as the output’s index. If the threshold is not crossed,pd.NA
is returned.
Notes
This function only returns times that are in the columns of
scmrun
. If you want a finer resolution then you should interpolate your data first. For example, if you have data on a ten-year timestep but want crossing times on an annual resolution, interpolate (or resample) to annual data before callingcalculate_crossing_times
.
calculate_crossing_times_quantiles
- calculate_crossing_times_quantiles(crossing_times, groupby, quantiles=(0.05, 0.5, 0.95), nan_fill_value=1000000, out_nan_threshold=100000, interpolation='linear')[source]
Calculate quantiles of crossing times
This calculation is non-trivial because some timeseries may never cross a given threshold. As a result, some care is required to return sensible quantiles. In this function, the quantiles are calculated as follows:
all nan values in
crossing_times
are filled withnan_fill_value
quantiles are calculated using
pd.groupby.quantile
quantiles which never crossed are inferred by examining whether the output values are greater than
out_nan_threshold
. If the calculated value is greater thanout_nan_threshold
then nan is returned for this quantile.
- Parameters:
crossing_times (
pd.Series
) – Crossing times, can be calculated usingscmdata.processing.calculate_crossing_times()
quantiles (float) – Quantiles to calculate
nan_fill_value (float) – Value to use to fill in nan values before calculating the quantiles
out_nan_threshold (float) – Threshold to decide whether a calculated quantile should be nan or not
interpolation (str) – Interpolation to use when calculating the quantiles, see
pandas.Series.quantile()
- Returns:
pd.Series
– Crossing time quantiles- Raises:
NotImplementedError –
crossing_times
contains datetime objects, please raise an issue if this is your use case.
Examples
>>> crossing_times = pd.Series( ... [pd.NA, pd.NA, 2100, 2007, 2006, pd.NA, 2100, 2007, 2006, 2006], ... index=pd.MultiIndex.from_product( ... [["a_scenario"], ["z_model", "x_model"], range(5)], ... names=["scenario", "climate_model", "ensemble_member"], ... ), ... ) >>> crossing_times scenario climate_model ensemble_member a_scenario z_model 0 <NA> 1 <NA> 2 2100 3 2007 4 2006 x_model 0 <NA> 1 2100 2 2007 3 2006 4 2006 dtype: object >>> calculate_crossing_times_quantiles( ... crossing_times, groupby=["climate_model", "scenario"] ... ) climate_model scenario quantile x_model a_scenario 0.05 2006.0 0.50 2007.0 0.95 NaN z_model a_scenario 0.05 2006.2 0.50 2100.0 0.95 NaN dtype: float64
calculate_exceedance_probabilities
- calculate_exceedance_probabilities(scmrun, threshold, process_over_cols, output_name=None)[source]
Calculate exceedance probability over all time
- Parameters:
scmrun (
scmdata.ScmRun
) – Ensemble of which to calculate the exceedance probabilitythreshold (float) – Value to use as the threshold for exceedance
process_over_cols (list[str]) – Columns to not use when grouping the timeseries (typically “run_id” or “ensemble_member” or similar)
output_name (str) – If supplied, the name of the output series. If not supplied, “{threshold} exceedance probability” will be used.
- Returns:
pd.Series
– Exceedance probability over all time over all members of each group inscmrun
- Raises:
ValueError –
scmrun
has more than one variable or more than one unit (convert to a single unit before calling this function if needed)
Notes
See the notes of
scmdata.processing.calculate_exceedance_probabilities_over_time()
for an explanation of how the two calculations differ. For most purposes, this is the correct function to use.
calculate_exceedance_probabilities_over_time
- calculate_exceedance_probabilities_over_time(scmrun, threshold, process_over_cols, output_name=None)[source]
Calculate exceedance probability at each point in time
- Parameters:
scmrun (
scmdata.ScmRun
) – Ensemble of which to calculate the exceedance probability over timethreshold (float) – Value to use as the threshold for exceedance
process_over_cols (list[str]) – Columns to not use when grouping the timeseries (typically “run_id” or “ensemble_member” or similar)
output_name (str) – If supplied, the value to put in the “variable” columns of the output
pd.DataFrame
. If not supplied, “{threshold} exceedance probability” will be used.
- Returns:
pd.DataFrame
– Timeseries of exceedance probability over time- Raises:
ValueError –
scmrun
has more than one variable or more than one unit (convert to a single unit before calling this function if needed)
Notes
This differs from
scmdata.processing.calculate_exceedance_probabilities()
because it calculates the exceedance probability at each point in time. That is different from calculating the exceedance probability by first determining the number of ensemble members which cross the threshold at any point in time and then dividing by the number of ensemble members. In general, this function will produce a maximum exceedance probability which is equal to or less than the output ofscmdata.processing.calculate_exceedance_probabilities()
. In our opinion,scmdata.processing.calculate_exceedance_probabilities()
is the correct function to use if you want to know the exceedance probability of a scenario. This function gives a sense of how the exceedance probability evolves over time but, as we said, will generally slightly underestimate the exceedance probability over all time.
calculate_peak
- calculate_peak(scmrun, output_name=None)[source]
Calculate peak i.e. maximum of each timeseries
- Parameters:
scmrun (
scmdata.ScmRun
) – Ensemble of which to calculate the exceedance probability over timeoutput_name (str) – If supplied, the value to put in the “variable” columns of the output series. If not supplied, “Peak {variable}” will be used.
- Returns:
pd.Series
– Peak of each timeseries
calculate_peak_time
- calculate_peak_time(scmrun, output_name=None, return_year=True)[source]
Calculate peak time i.e. the time at which each timeseries reaches its maximum
- Parameters:
scmrun (
scmdata.ScmRun
) – Ensemble of which to calculate the exceedance probability over timeoutput_name (str) – If supplied, the value to put in the “variable” columns of the output series. If not supplied, “Peak {variable}” will be used.
return_year (bool) – If
True
, return the year instead of the datetime
- Returns:
pd.Series
– Peak of each timeseries
categorisation_sr15
- categorisation_sr15(scmrun, index)[source]
Categorise using the algorithm employed in SR1.5
For more information, see the SR1.5 scenario analysis notebook.
- Parameters:
scmrun – Data to use for the classification. This should contain global-mean surface air temperatures (GSAT) relative to 1850-1900 (using another reference period will not break this function, but is inconsistent with the original algorithm). The data must have a “quantile” column and it must have the 0.33, 0.5 and 0.66 quantiles calculated. This can be done with
scmdata.ScmRun.quantiles_over()
.index (list[str]) – Columns in
scmrun.meta
to use as the index of the output
- Returns:
class: pd.Series – Categorisation of the timeseries
- Raises:
ValueError – More than one variable or one unit is in
scmrun
DimensionalityError – The units cannot be converted to kelvin
calculate_summary_stats
- calculate_summary_stats(scmrun, index, exceedance_probabilities_thresholds=(1.5, 2.0, 2.5), exceedance_probabilities_variable='Surface Air Temperature Change', exceedance_probabilities_naming_base=None, peak_quantiles=(0.05, 0.17, 0.5, 0.83, 0.95), peak_variable='Surface Air Temperature Change', peak_naming_base=None, peak_time_naming_base=None, peak_return_year=True, categorisation_variable='Surface Air Temperature Change', categorisation_quantile_cols=('ensemble_member',), progress=False)[source]
Calculate common summary statistics
- Parameters:
scmrun (
scmdata.ScmRun
) – Data of which to calculate the statsindex (list[str]) – Columns to use in the index of the output (unit is added if not included)
exceedance_probabilities_thresholds (list[float]) – Thresholds to use for exceedance probabilities
exceedance_probabilities_variable (str) – Variable to use for exceedance probability calculations
exceedance_probabilities_naming_base (str) – String to use as the base for naming the exceedance probabilities. Each exceedance probability output column will have a name given by
exceedance_probabilities_naming_base.format(threshold)
where threshold is the exceedance probability threshold to use. If not supplied, the default output ofscmdata.processing.calculate_exceedance_probabilities()
will be used.peak_quantiles (list[float]) – Quantiles to report in peak calculations
peak_variable (str) – Variable of which to calculate the peak
peak_naming_base (str) – Base to use for naming the peak outputs. This is combined with the quantile. If not supplied,
"{} peak"
is used so the outputs will be named e.g. “0.05 peak”, “0.5 peak”, “0.95 peak”.peak_time_naming_base (str) – Base to use for naming the peak time outputs. This is combined with the quantile. If not supplied,
"{} peak year"
is used (unlesspeak_return_year
isFalse
in which case"{} peak time"
is used) so the outputs will be named e.g. “0.05 peak year”, “0.5 peak year”, “0.95 peak year”.peak_return_year (bool) – If
True
, return the year of the peak ofpeak_variable
, otherwise return full datescategorisation_variable (str) – Variable to use for categorisation. Note that this variable point to timeseries that contain global-mean surface air temperatures (GSAT) relative to 1850-1900 (using another reference period will not break this function, but is inconsistent with the original algorithm).
categorisation_quantile_cols (list[str]) – Columns which represent individual ensemble members in the output (e.g. [“ensemble_member”]). The quantiles are taking over these columns before the data is passed to
scmdata.processing.categorisation_sr15()
.progress (bool) – Should a progress bar be shown whilst the calculations are done?
- Returns:
pd.DataFrame
– Summary statistics, with each column being a statistic and the index being given byindex