scmdata.groupby

Functionality for grouping and filtering ScmRun objects

RunGroupBy

class RunGroupBy(run, groups, na_fill_value=-10000)[source]

Bases: ImplementsArrayReduce, Generic[GenericRun]

GroupBy object specialized to grouping ScmRun objects

all(dim=None, axis=None, **kwargs)

Reduce this RunGroupBy’s data by applying all along some dimension(s).

Parameters:
  • dim (str or sequence of str, optional) – Dimension(s) over which to apply all.

  • axis (int or sequence of int, optional) – Axis(es) over which to apply all. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then all is calculated over axes.

  • keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.

  • **kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating all on this object’s data.

Returns:

reduced (RunGroupBy) – New RunGroupBy object with all applied to its data and the indicated dimension(s) removed.

any(dim=None, axis=None, **kwargs)

Reduce this RunGroupBy’s data by applying any along some dimension(s).

Parameters:
  • dim (str or sequence of str, optional) – Dimension(s) over which to apply any.

  • axis (int or sequence of int, optional) – Axis(es) over which to apply any. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then any is calculated over axes.

  • keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.

  • **kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating any on this object’s data.

Returns:

reduced (RunGroupBy) – New RunGroupBy object with any applied to its data and the indicated dimension(s) removed.

apply(func, *args, **kwargs)[source]

Apply a function to each group and append the results

func is called like func(ar, *args, **kwargs) for each ScmRun group. If the result of this function call is None, than it is excluded from the results.

The results are appended together using run_append(). The function can change the size of the input ScmRun as long as run_append() can be applied to all results.

Examples

>>> from scmdata import ScmRun
>>> def show_var_and_convert_unit(arr: scmdata.ScmRun) -> None:
...     variable = arr.get_unique_meta("variable", True)
...     unit = arr.get_unique_meta("unit", True)
...     print(f"{variable}'s original unit was {unit}")
...
...     return arr.convert_unit("MtC")

>>> df = ScmRun(
...     data=[[1, 2], [3, 4]],
...     index=[2010, 2020],
...     columns={
...         "variable": ["v1", "v2"],
...         "model": "model",
...         "scenario": "scenario",
...         "region": "World",
...         "unit": ["tC", "GtC"],
...     },
... )
>>> df.groupby("variable").apply(show_var_and_convert_unit)
v1's original unit was tC
v2's original unit was GtC
<ScmRun (timeseries: 2, timepoints: 2)>
Time:
    Start: 2010-01-01T00:00:00
    End: 2020-01-01T00:00:00
Meta:
       model region  scenario unit variable
    0  model  World  scenario  MtC       v1
    1  model  World  scenario  MtC       v2
Parameters:
  • func (Callable[Concatenate[GenericRun, P], GenericRun | (pd.DataFrame | None)]) – Callable to apply to each group.

  • *args (P.args) – Positional arguments passed to func.

  • **kwargs (P.kwargs) – Keyword arguments passed to func.

Returns:

GenericRun – The result of applying and combining.

apply_parallel(func, parallel_processor=None, *args, **kwargs)[source]

Apply a function to each group in parallel and append the results

Provides the same functionality as apply() except that parallel processing can be used via the parallel_processor argument. By default, joblib is used to apply func to each group in parallel. This can be slower than using apply() for small numbers of groups or in the case where func is fast as there is overhead setting up the processing pool.

See also

apply()

Parameters:
  • func (ApplyCallable[GenericRun, P]) – Callable to apply to each group.

  • parallel_processor (ParallelProcessor[GenericRun, P] | None) –

    Parallel processor to use to process the groups. If not provided, a default joblib parallel processor is used (for details, see

  • *args (P.args) – Positional arguments passed to func.

  • **kwargs (P.kwargs) – Keyword arguments passed to func.

Returns:

GenericRun – The result of applying and combining.

count(dim=None, axis=None, **kwargs)

Reduce this RunGroupBy’s data by applying count along some dimension(s).

Parameters:
  • dim (str or sequence of str, optional) – Dimension(s) over which to apply count.

  • axis (int or sequence of int, optional) – Axis(es) over which to apply count. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then count is calculated over axes.

  • keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.

  • **kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating count on this object’s data.

Returns:

reduced (RunGroupBy) – New RunGroupBy object with count applied to its data and the indicated dimension(s) removed.

map(func, *args, **kwargs)[source]

Apply a function to each group and append the results

Deprecated since version 0.14.2: map() will be removed in scmdata 1.0.0, it is renamed to apply() with identical functionality.

See also

apply()

max(dim=None, axis=None, skipna=None, **kwargs)

Reduce this RunGroupBy’s data by applying max along some dimension(s).

Parameters:
  • dim (str or sequence of str, optional) – Dimension(s) over which to apply max.

  • axis (int or sequence of int, optional) – Axis(es) over which to apply max. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then max is calculated over axes.

  • skipna (bool, optional) – If True, skip missing values (as marked by NaN). By default, only skips missing values for float dtypes; other dtypes either do not have a sentinel missing value (int) or skipna=True has not been implemented (object, datetime64 or timedelta64).

  • keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.

  • **kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating max on this object’s data.

Returns:

reduced (RunGroupBy) – New RunGroupBy object with max applied to its data and the indicated dimension(s) removed.

mean(dim=None, axis=None, skipna=None, **kwargs)

Reduce this RunGroupBy’s data by applying mean along some dimension(s).

Parameters:
  • dim (str or sequence of str, optional) – Dimension(s) over which to apply mean.

  • axis (int or sequence of int, optional) – Axis(es) over which to apply mean. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then mean is calculated over axes.

  • skipna (bool, optional) – If True, skip missing values (as marked by NaN). By default, only skips missing values for float dtypes; other dtypes either do not have a sentinel missing value (int) or skipna=True has not been implemented (object, datetime64 or timedelta64).

  • keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.

  • **kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating mean on this object’s data.

Returns:

reduced (RunGroupBy) – New RunGroupBy object with mean applied to its data and the indicated dimension(s) removed.

median(dim=None, axis=None, skipna=None, **kwargs)

Reduce this RunGroupBy’s data by applying median along some dimension(s).

Parameters:
  • dim (str or sequence of str, optional) – Dimension(s) over which to apply median.

  • axis (int or sequence of int, optional) – Axis(es) over which to apply median. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then median is calculated over axes.

  • skipna (bool, optional) – If True, skip missing values (as marked by NaN). By default, only skips missing values for float dtypes; other dtypes either do not have a sentinel missing value (int) or skipna=True has not been implemented (object, datetime64 or timedelta64).

  • keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.

  • **kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating median on this object’s data.

Returns:

reduced (RunGroupBy) – New RunGroupBy object with median applied to its data and the indicated dimension(s) removed.

min(dim=None, axis=None, skipna=None, **kwargs)

Reduce this RunGroupBy’s data by applying min along some dimension(s).

Parameters:
  • dim (str or sequence of str, optional) – Dimension(s) over which to apply min.

  • axis (int or sequence of int, optional) – Axis(es) over which to apply min. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then min is calculated over axes.

  • skipna (bool, optional) – If True, skip missing values (as marked by NaN). By default, only skips missing values for float dtypes; other dtypes either do not have a sentinel missing value (int) or skipna=True has not been implemented (object, datetime64 or timedelta64).

  • keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.

  • **kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating min on this object’s data.

Returns:

reduced (RunGroupBy) – New RunGroupBy object with min applied to its data and the indicated dimension(s) removed.

prod(dim=None, axis=None, skipna=None, **kwargs)

Reduce this RunGroupBy’s data by applying prod along some dimension(s).

Parameters:
  • dim (str or sequence of str, optional) – Dimension(s) over which to apply prod.

  • axis (int or sequence of int, optional) – Axis(es) over which to apply prod. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then prod is calculated over axes.

  • skipna (bool, optional) – If True, skip missing values (as marked by NaN). By default, only skips missing values for float dtypes; other dtypes either do not have a sentinel missing value (int) or skipna=True has not been implemented (object, datetime64 or timedelta64).

  • min_count (int, default: None) – The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA. Only used if skipna is set to True or defaults to True for the array’s dtype. New in version 0.10.8: Added with the default being None. Changed in version 0.17.0: if specified on an integer array and skipna=True, the result will be a float array.

  • keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.

  • **kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating prod on this object’s data.

Returns:

reduced (RunGroupBy) – New RunGroupBy object with prod applied to its data and the indicated dimension(s) removed.

reduce(func, dim=None, axis=None, *args, **kwargs)[source]

Reduce the items in this group by applying func along some dimension(s).

Parameters:
  • func (function) – Function which can be called in the form func(x, axis=axis, **kwargs) to return the result of collapsing an np.ndarray over an integer valued axis.

  • dim (, str or sequence of str, optional) – Not used in this implementation

  • axis (int or sequence of int, optional) – Axis(es) over which to apply func. Only one of the ‘dimension’ and ‘axis’ arguments can be supplied. If neither are supplied, then func is calculated over all dimension for each group item.

  • **kwargs (dict) – Additional keyword arguments passed on to func.

Returns:

reduced (ScmRun) – Array with summarized data and the indicated dimension(s) removed.

std(dim=None, axis=None, skipna=None, **kwargs)

Reduce this RunGroupBy’s data by applying std along some dimension(s).

Parameters:
  • dim (str or sequence of str, optional) – Dimension(s) over which to apply std.

  • axis (int or sequence of int, optional) – Axis(es) over which to apply std. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then std is calculated over axes.

  • skipna (bool, optional) – If True, skip missing values (as marked by NaN). By default, only skips missing values for float dtypes; other dtypes either do not have a sentinel missing value (int) or skipna=True has not been implemented (object, datetime64 or timedelta64).

  • keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.

  • **kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating std on this object’s data.

Returns:

reduced (RunGroupBy) – New RunGroupBy object with std applied to its data and the indicated dimension(s) removed.

sum(dim=None, axis=None, skipna=None, **kwargs)

Reduce this RunGroupBy’s data by applying sum along some dimension(s).

Parameters:
  • dim (str or sequence of str, optional) – Dimension(s) over which to apply sum.

  • axis (int or sequence of int, optional) – Axis(es) over which to apply sum. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then sum is calculated over axes.

  • skipna (bool, optional) – If True, skip missing values (as marked by NaN). By default, only skips missing values for float dtypes; other dtypes either do not have a sentinel missing value (int) or skipna=True has not been implemented (object, datetime64 or timedelta64).

  • min_count (int, default: None) – The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA. Only used if skipna is set to True or defaults to True for the array’s dtype. New in version 0.10.8: Added with the default being None. Changed in version 0.17.0: if specified on an integer array and skipna=True, the result will be a float array.

  • keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.

  • **kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating sum on this object’s data.

Returns:

reduced (RunGroupBy) – New RunGroupBy object with sum applied to its data and the indicated dimension(s) removed.

var(dim=None, axis=None, skipna=None, **kwargs)

Reduce this RunGroupBy’s data by applying var along some dimension(s).

Parameters:
  • dim (str or sequence of str, optional) – Dimension(s) over which to apply var.

  • axis (int or sequence of int, optional) – Axis(es) over which to apply var. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then var is calculated over axes.

  • skipna (bool, optional) – If True, skip missing values (as marked by NaN). By default, only skips missing values for float dtypes; other dtypes either do not have a sentinel missing value (int) or skipna=True has not been implemented (object, datetime64 or timedelta64).

  • keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.

  • **kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating var on this object’s data.

Returns:

reduced (RunGroupBy) – New RunGroupBy object with var applied to its data and the indicated dimension(s) removed.

get_joblib_parallel_processor

get_joblib_parallel_processor(n_jobs=-1, backend='loky', *args, **kwargs)[source]

Get parallel processor using joblib as the backend.

Parameters:
  • n_jobs (int) – Number of jobs to run in parallel. If -1 all CPUs are used.

  • backend (str) – Backend used for parallelisation. Defaults to ‘loky’ which uses separate processes for each worker. See joblib.Parallel for a more complete description of the available options.

  • *args (typing.Any) – Passed to initialiser of joblib.Parallel

  • **kwargs (typing.Any) – Passed to initialiser of joblib.Parallel

Returns:

typing.Callable[[typing.Callable[[typing.TypeVar(RunLike, bound= scmdata.run.BaseScmRun), typing.ParamSpec(Q)], typing.Union[typing.TypeVar(RunLike, bound= scmdata.run.BaseScmRun), pandas.core.frame.DataFrame, None]], collections.abc.Iterable[typing.TypeVar(RunLike, bound= scmdata.run.BaseScmRun)], typing.ParamSpec(Q)], collections.abc.Iterable[typing.Union[typing.TypeVar(RunLike, bound= scmdata.run.BaseScmRun), pandas.core.frame.DataFrame, None]]] – Function that can be used for parallel processing in RunGroupBy.apply_parallel()