scmdata.groupby
Functionality for grouping and filtering ScmRun objects
RunGroupBy
- class RunGroupBy(run, groups, na_fill_value=-10000)[source]
Bases:
ImplementsArrayReduce
,Generic
[GenericRun
]GroupBy object specialized to grouping ScmRun objects
- all(dim=None, axis=None, **kwargs)
Reduce this RunGroupBy’s data by applying all along some dimension(s).
- Parameters:
dim (str or sequence of str, optional) – Dimension(s) over which to apply all.
axis (int or sequence of int, optional) – Axis(es) over which to apply all. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then all is calculated over axes.
keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.
**kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating all on this object’s data.
- Returns:
reduced (RunGroupBy) – New RunGroupBy object with all applied to its data and the indicated dimension(s) removed.
- any(dim=None, axis=None, **kwargs)
Reduce this RunGroupBy’s data by applying any along some dimension(s).
- Parameters:
dim (str or sequence of str, optional) – Dimension(s) over which to apply any.
axis (int or sequence of int, optional) – Axis(es) over which to apply any. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then any is calculated over axes.
keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.
**kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating any on this object’s data.
- Returns:
reduced (RunGroupBy) – New RunGroupBy object with any applied to its data and the indicated dimension(s) removed.
- apply(func, *args, **kwargs)[source]
Apply a function to each group and append the results
func is called like func(ar, *args, **kwargs) for each
ScmRun
group. If the result of this function call isNone
, than it is excluded from the results.The results are appended together using
run_append()
. The function can change the size of the inputScmRun
as long asrun_append()
can be applied to all results.Examples
>>> from scmdata import ScmRun >>> def show_var_and_convert_unit(arr: scmdata.ScmRun) -> None: ... variable = arr.get_unique_meta("variable", True) ... unit = arr.get_unique_meta("unit", True) ... print(f"{variable}'s original unit was {unit}") ... ... return arr.convert_unit("MtC") >>> df = ScmRun( ... data=[[1, 2], [3, 4]], ... index=[2010, 2020], ... columns={ ... "variable": ["v1", "v2"], ... "model": "model", ... "scenario": "scenario", ... "region": "World", ... "unit": ["tC", "GtC"], ... }, ... ) >>> df.groupby("variable").apply(show_var_and_convert_unit) v1's original unit was tC v2's original unit was GtC <ScmRun (timeseries: 2, timepoints: 2)> Time: Start: 2010-01-01T00:00:00 End: 2020-01-01T00:00:00 Meta: model region scenario unit variable 0 model World scenario MtC v1 1 model World scenario MtC v2
- Parameters:
func (Callable[Concatenate[GenericRun, P], GenericRun | (pd.DataFrame | None)]) – Callable to apply to each group.
*args (P.args) – Positional arguments passed to func.
**kwargs (P.kwargs) – Keyword arguments passed to func.
- Returns:
GenericRun – The result of applying and combining.
- apply_parallel(func, parallel_processor=None, *args, **kwargs)[source]
Apply a function to each group in parallel and append the results
Provides the same functionality as
apply()
except that parallel processing can be used via theparallel_processor
argument. By default,joblib
is used to apply func to each group in parallel. This can be slower than usingapply()
for small numbers of groups or in the case where func is fast as there is overhead setting up the processing pool.See also
- Parameters:
func (ApplyCallable[GenericRun, P]) – Callable to apply to each group.
parallel_processor (ParallelProcessor[GenericRun, P] | None) –
Parallel processor to use to process the groups. If not provided, a default joblib parallel processor is used (for details, see
*args (P.args) – Positional arguments passed to func.
**kwargs (P.kwargs) – Keyword arguments passed to func.
- Returns:
GenericRun – The result of applying and combining.
- count(dim=None, axis=None, **kwargs)
Reduce this RunGroupBy’s data by applying count along some dimension(s).
- Parameters:
dim (str or sequence of str, optional) – Dimension(s) over which to apply count.
axis (int or sequence of int, optional) – Axis(es) over which to apply count. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then count is calculated over axes.
keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.
**kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating count on this object’s data.
- Returns:
reduced (RunGroupBy) – New RunGroupBy object with count applied to its data and the indicated dimension(s) removed.
- map(func, *args, **kwargs)[source]
Apply a function to each group and append the results
Deprecated since version 0.14.2:
map()
will be removed in scmdata 1.0.0, it is renamed toapply()
with identical functionality.See also
- max(dim=None, axis=None, skipna=None, **kwargs)
Reduce this RunGroupBy’s data by applying max along some dimension(s).
- Parameters:
dim (str or sequence of str, optional) – Dimension(s) over which to apply max.
axis (int or sequence of int, optional) – Axis(es) over which to apply max. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then max is calculated over axes.
skipna (bool, optional) – If True, skip missing values (as marked by NaN). By default, only skips missing values for float dtypes; other dtypes either do not have a sentinel missing value (int) or skipna=True has not been implemented (object, datetime64 or timedelta64).
keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.
**kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating max on this object’s data.
- Returns:
reduced (RunGroupBy) – New RunGroupBy object with max applied to its data and the indicated dimension(s) removed.
- mean(dim=None, axis=None, skipna=None, **kwargs)
Reduce this RunGroupBy’s data by applying mean along some dimension(s).
- Parameters:
dim (str or sequence of str, optional) – Dimension(s) over which to apply mean.
axis (int or sequence of int, optional) – Axis(es) over which to apply mean. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then mean is calculated over axes.
skipna (bool, optional) – If True, skip missing values (as marked by NaN). By default, only skips missing values for float dtypes; other dtypes either do not have a sentinel missing value (int) or skipna=True has not been implemented (object, datetime64 or timedelta64).
keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.
**kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating mean on this object’s data.
- Returns:
reduced (RunGroupBy) – New RunGroupBy object with mean applied to its data and the indicated dimension(s) removed.
- median(dim=None, axis=None, skipna=None, **kwargs)
Reduce this RunGroupBy’s data by applying median along some dimension(s).
- Parameters:
dim (str or sequence of str, optional) – Dimension(s) over which to apply median.
axis (int or sequence of int, optional) – Axis(es) over which to apply median. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then median is calculated over axes.
skipna (bool, optional) – If True, skip missing values (as marked by NaN). By default, only skips missing values for float dtypes; other dtypes either do not have a sentinel missing value (int) or skipna=True has not been implemented (object, datetime64 or timedelta64).
keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.
**kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating median on this object’s data.
- Returns:
reduced (RunGroupBy) – New RunGroupBy object with median applied to its data and the indicated dimension(s) removed.
- min(dim=None, axis=None, skipna=None, **kwargs)
Reduce this RunGroupBy’s data by applying min along some dimension(s).
- Parameters:
dim (str or sequence of str, optional) – Dimension(s) over which to apply min.
axis (int or sequence of int, optional) – Axis(es) over which to apply min. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then min is calculated over axes.
skipna (bool, optional) – If True, skip missing values (as marked by NaN). By default, only skips missing values for float dtypes; other dtypes either do not have a sentinel missing value (int) or skipna=True has not been implemented (object, datetime64 or timedelta64).
keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.
**kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating min on this object’s data.
- Returns:
reduced (RunGroupBy) – New RunGroupBy object with min applied to its data and the indicated dimension(s) removed.
- prod(dim=None, axis=None, skipna=None, **kwargs)
Reduce this RunGroupBy’s data by applying prod along some dimension(s).
- Parameters:
dim (str or sequence of str, optional) – Dimension(s) over which to apply prod.
axis (int or sequence of int, optional) – Axis(es) over which to apply prod. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then prod is calculated over axes.
skipna (bool, optional) – If True, skip missing values (as marked by NaN). By default, only skips missing values for float dtypes; other dtypes either do not have a sentinel missing value (int) or skipna=True has not been implemented (object, datetime64 or timedelta64).
min_count (int, default: None) – The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA. Only used if skipna is set to True or defaults to True for the array’s dtype. New in version 0.10.8: Added with the default being None. Changed in version 0.17.0: if specified on an integer array and skipna=True, the result will be a float array.
keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.
**kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating prod on this object’s data.
- Returns:
reduced (RunGroupBy) – New RunGroupBy object with prod applied to its data and the indicated dimension(s) removed.
- reduce(func, dim=None, axis=None, *args, **kwargs)[source]
Reduce the items in this group by applying func along some dimension(s).
- Parameters:
func (function) – Function which can be called in the form func(x, axis=axis, **kwargs) to return the result of collapsing an np.ndarray over an integer valued axis.
dim (…, str or sequence of str, optional) – Not used in this implementation
axis (int or sequence of int, optional) – Axis(es) over which to apply func. Only one of the ‘dimension’ and ‘axis’ arguments can be supplied. If neither are supplied, then func is calculated over all dimension for each group item.
**kwargs (dict) – Additional keyword arguments passed on to func.
- Returns:
reduced (
ScmRun
) – Array with summarized data and the indicated dimension(s) removed.
- std(dim=None, axis=None, skipna=None, **kwargs)
Reduce this RunGroupBy’s data by applying std along some dimension(s).
- Parameters:
dim (str or sequence of str, optional) – Dimension(s) over which to apply std.
axis (int or sequence of int, optional) – Axis(es) over which to apply std. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then std is calculated over axes.
skipna (bool, optional) – If True, skip missing values (as marked by NaN). By default, only skips missing values for float dtypes; other dtypes either do not have a sentinel missing value (int) or skipna=True has not been implemented (object, datetime64 or timedelta64).
keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.
**kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating std on this object’s data.
- Returns:
reduced (RunGroupBy) – New RunGroupBy object with std applied to its data and the indicated dimension(s) removed.
- sum(dim=None, axis=None, skipna=None, **kwargs)
Reduce this RunGroupBy’s data by applying sum along some dimension(s).
- Parameters:
dim (str or sequence of str, optional) – Dimension(s) over which to apply sum.
axis (int or sequence of int, optional) – Axis(es) over which to apply sum. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then sum is calculated over axes.
skipna (bool, optional) – If True, skip missing values (as marked by NaN). By default, only skips missing values for float dtypes; other dtypes either do not have a sentinel missing value (int) or skipna=True has not been implemented (object, datetime64 or timedelta64).
min_count (int, default: None) – The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA. Only used if skipna is set to True or defaults to True for the array’s dtype. New in version 0.10.8: Added with the default being None. Changed in version 0.17.0: if specified on an integer array and skipna=True, the result will be a float array.
keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.
**kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating sum on this object’s data.
- Returns:
reduced (RunGroupBy) – New RunGroupBy object with sum applied to its data and the indicated dimension(s) removed.
- var(dim=None, axis=None, skipna=None, **kwargs)
Reduce this RunGroupBy’s data by applying var along some dimension(s).
- Parameters:
dim (str or sequence of str, optional) – Dimension(s) over which to apply var.
axis (int or sequence of int, optional) – Axis(es) over which to apply var. Only one of the ‘dim’ and ‘axis’ arguments can be supplied. If neither are supplied, then var is calculated over axes.
skipna (bool, optional) – If True, skip missing values (as marked by NaN). By default, only skips missing values for float dtypes; other dtypes either do not have a sentinel missing value (int) or skipna=True has not been implemented (object, datetime64 or timedelta64).
keep_attrs (bool, optional) – If True, the attributes (attrs) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.
**kwargs (dict) – Additional keyword arguments passed on to the appropriate array function for calculating var on this object’s data.
- Returns:
reduced (RunGroupBy) – New RunGroupBy object with var applied to its data and the indicated dimension(s) removed.
get_joblib_parallel_processor
- get_joblib_parallel_processor(n_jobs=-1, backend='loky', *args, **kwargs)[source]
Get parallel processor using
joblib
as the backend.- Parameters:
n_jobs (
int
) – Number of jobs to run in parallel. If -1 all CPUs are used.backend (
str
) – Backend used for parallelisation. Defaults to ‘loky’ which uses separate processes for each worker. Seejoblib.Parallel
for a more complete description of the available options.*args (
typing.Any
) – Passed to initialiser ofjoblib.Parallel
**kwargs (
typing.Any
) – Passed to initialiser ofjoblib.Parallel
- Returns:
typing.Callable
[[typing.Callable
[[typing.TypeVar
(RunLike
, bound=scmdata.run.BaseScmRun
),typing.ParamSpec
(Q
)],typing.Union
[typing.TypeVar
(RunLike
, bound=scmdata.run.BaseScmRun
),pandas.core.frame.DataFrame
,None
]],collections.abc.Iterable
[typing.TypeVar
(RunLike
, bound=scmdata.run.BaseScmRun
)],typing.ParamSpec
(Q
)],collections.abc.Iterable
[typing.Union
[typing.TypeVar
(RunLike
, bound=scmdata.run.BaseScmRun
),pandas.core.frame.DataFrame
,None
]]] – Function that can be used for parallel processing inRunGroupBy.apply_parallel()