Python groupby agg function

Содержание

Объяснение функций Grouper и Agg в Pandas¶
Группировка данных временных рядов¶
Новая и улучшенная агрегатная функция¶
pandas.core.groupby.DataFrameGroupBy.aggregate#

Объяснение функций Grouper и Agg в Pandas¶

Время от времени полезно сделать шаг назад и посмотреть на новые способы решения старых задач. Недавно, работая над проблемой, я заметил, что в pandas есть функция Grouper , которую я никогда раньше не вызывал. Я изучил, как ее можно использовать, и оказалось, что она полезна для того типа сводного анализа, который я обычно выполняю.

В дополнение к ранним функциям pandas с каждым выпуском продолжает предоставлять новые и улучшенные возможности. Например, обновленная функция agg — еще один очень полезный и интуитивно понятный инструмент для обобщения данных.

В этой статье рассказывается, как вы можете использовать функции Grouper и agg для собственных данных. Попутно я буду включать некоторые советы и приемы, как их использовать наиболее эффективно.

Группировка данных временных рядов¶

Pandas берет свое начало в финансовой индустрии, поэтому неудивительно, что у него есть надежные средства для обработки данных временных рядов. Просто посмотрите обширную документацию по временным рядам, чтобы почувствовать все возможности.

Рассмотрим пример данных о продажах и некоторые простые операции для получения общих продаж по месяцам, дням, годам и т.д.

df = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=True") df.head()

account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55

Обратим внимание на типы данных:

 RangeIndex: 1500 entries, 0 to 1499 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 account number 1500 non-null int64 1 name 1500 non-null object 2 sku 1500 non-null object 3 quantity 1500 non-null int64 4 unit price 1500 non-null float64 5 ext price 1500 non-null float64 6 date 1500 non-null object dtypes: float64(2), int64(2), object(3) memory usage: 82.2+ KB

Столбец date приведем к типу datetime :

df["date"] = pd.to_datetime(df['date'])

account number int64 name object sku object quantity int64 unit price float64 ext price float64 date datetime64[ns] dtype: object

Прежде чем я продвинусь дальше, полезно познакомиться с псевдонимами смещения ( Offset Aliases ). Эти строки используются для представления различных временных частот, таких как дни, недели и годы.

Например, если вы хотите суммировать все продажи по месяцам, то можете использовать функцию resample . Особенность использования resample заключается в том, что она работает только с индексом. В этом наборе данные не индексируются по столбцу date , поэтому resample не будет работать без реструктуризации (restructuring).

Используйте set_index , чтобы сделать столбец date индексом, а затем выполните resample :

df.set_index('date').resample('M')["ext price"].sum()

date 2014-01-31 185361.66 2014-02-28 146211.62 2014-03-31 203921.38 2014-04-30 174574.11 2014-05-31 165418.55 2014-06-30 174089.33 2014-07-31 191662.11 2014-08-31 153778.59 2014-09-30 168443.17 2014-10-31 171495.32 2014-11-30 119961.22 2014-12-31 163867.26 Freq: M, Name: ext price, dtype: float64

Это довольно простой способ суммирования данных, но он усложняется, если вы хотите дополнительно провести группировку.

Читайте также: Java классы из json

Можно посмотреть ежемесячные результаты для каждого клиента:

df.set_index('date').groupby('name')["ext price"].resample("M").sum()

name date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 . Will LLC 2014-08-31 1439.82 2014-09-30 4345.99 2014-10-31 7085.33 2014-11-30 3210.44 2014-12-31 12561.21 Name: ext price, Length: 240, dtype: float64

Это работает, но выглядит немного неуклюжим.

К счастью, Grouper упрощает данную процедуру!

Вместо того, чтобы играть с переиндексированием, мы можем использовать обычный синтаксис groupby , но предоставить немного больше информации о том, как сгруппировать данные в столбце date :

df.groupby(['name', pd.Grouper(key='date', freq='M')])['ext price'].sum()

name date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 . Will LLC 2014-08-31 1439.82 2014-09-30 4345.99 2014-10-31 7085.33 2014-11-30 3210.44 2014-12-31 12561.21 Name: ext price, Length: 240, dtype: float64

Поскольку groupby — одна из моих любимых функций, этот подход кажется мне более простым и, скорее всего, останется в моей памяти.

Приятным дополнением является то, что для обобщенния в другом временном интервале, достаточно измените параметр freq на один из допустимых псевдонимов смещения.

Например, годовая сводка, использующая декабрь в качестве последнего месяца, будет выглядеть так:

df.groupby(['name', pd.Grouper(key='date', freq='A-DEC')])['ext price'].sum()

name date Barton LLC 2014-12-31 109438.50 Cronin, Oberbrunner and Spencer 2014-12-31 89734.55 Frami, Hills and Schmidt 2014-12-31 103569.59 Fritsch, Russel and Anderson 2014-12-31 112214.71 Halvorson, Crona and Champlin 2014-12-31 70004.36 Herman LLC 2014-12-31 82865.00 Jerde-Hilpert 2014-12-31 112591.43 Kassulke, Ondricka and Metz 2014-12-31 86451.07 Keeling LLC 2014-12-31 100934.30 Kiehn-Spinka 2014-12-31 99608.77 Koepp Ltd 2014-12-31 103660.54 Kuhn-Gusikowski 2014-12-31 91094.28 Kulas Inc 2014-12-31 137351.96 Pollich LLC 2014-12-31 87347.18 Purdy-Kunde 2014-12-31 77898.21 Sanford and Sons 2014-12-31 98822.98 Stokes LLC 2014-12-31 91535.92 Trantow-Barrows 2014-12-31 123381.38 White-Trantow 2014-12-31 135841.99 Will LLC 2014-12-31 104437.60 Name: ext price, dtype: float64

Если ваши годовые продажи были не календарными, то данные можно легко изменить, передав параметр freq .

Читайте также: Call javascript on ready

Призываю вас поиграть с разными смещениями, чтобы понять, как это работает. При суммировании данных временных рядов это невероятно удобно!

Попробуйте реализовать это в Excel , что, безусловно, возможно (с использованием сводных таблиц и настраиваемой группировки), но я не думаю, что это так же интуитивно понятно, как в pandas.

Новая и улучшенная агрегатная функция¶

В pandas 0.20.0 была добавлена новая функция agg , которая значительно упрощает суммирование данных аналогично groupby.

Чтобы проиллюстрировать ее функциональность, предположим, что нам нужно получить сумму в столбцах ext price и quantity (количество), а также среднее значение unit price (цены за единицу).

Источник

pandas.core.groupby.DataFrameGroupBy.aggregate#

Aggregate using one or more operations over the specified axis.

Parameters func function, str, list, dict or None

Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.

Accepted combinations are:

function
string function name
list of functions and/or function names, e.g. [np.sum, ‘mean’]
dict of axis labels -> functions, function names or list of such.
None, in which case **kwargs are used with Named Aggregation. Here the output has one column for each element in **kwargs . The name of the column is keyword, whereas the value determines the aggregation used to compute the values in the column.

Can also accept a Numba JIT function with engine=’numba’ specified. Only passing a single function is supported with this engine.

If the ‘numba’ engine is chosen, the function must be a user defined function with values and index as the first and second arguments respectively in the function signature. Each group’s index will be passed to the user defined function and optionally available for use.

Positional arguments to pass to func.

‘cython’ : Runs the function through C-extensions from cython.
‘numba’ : Runs the function through JIT compiled code from numba.
None : Defaults to ‘cython’ or globally setting compute.use_numba

For ‘cython’ engine, there are no accepted engine_kwargs
For ‘numba’ engine, the engine can accept nopython , nogil and parallel dictionary keys. The values must either be True or False . The default engine_kwargs for the ‘numba’ engine is and will be applied to the function

If func is None, **kwargs are used to define the output names and aggregations via Named Aggregation. See func entry.
Otherwise, keyword arguments to be passed into func.

Apply function func group-wise and combine the results together.

Transforms the Series on each group based on the given function.

Aggregate using one or more operations over the specified axis.

When using engine=’numba’ , there will be no “fall back” behavior internally. The group data and group index will be passed as numpy arrays to the JITed user defined function, and no alternative execution attempts will be tried.

Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See Mutating with User Defined Function (UDF) methods for more details.

Changed in version 1.3.0: The resulting dtype will reflect the return value of the passed func , see the examples below.

>>> df = pd.DataFrame( .  . "A": [1, 1, 2, 2], . "B": [1, 2, 3, 4], . "C": [0.362838, 0.227877, 1.267767, -0.562860], . > . )

>>> df A B C 0 1 1 0.362838 1 1 2 0.227877 2 2 3 1.267767 3 2 4 -0.562860

The aggregation is for each column.

>>> df.groupby('A').agg('min') B C A 1 1 0.227877 2 3 -0.562860

>>> df.groupby('A').agg(['min', 'max']) B C min max min max A 1 1 2 0.227877 0.362838 2 3 4 -0.562860 1.267767

Select a column for aggregation

>>> df.groupby('A').B.agg(['min', 'max']) min max A 1 1 2 2 3 4

User-defined function for aggregation

>>> df.groupby('A').agg(lambda x: sum(x) + 2) B C A 1 5 2.590715 2 9 2.704907

Different aggregations per column

>>> df.groupby('A').agg('B': ['min', 'max'], 'C': 'sum'>) B C min max sum A 1 1 2 0.590715 2 3 4 0.704907

To control the output names with different aggregations per column, pandas supports “named aggregation”

>>> df.groupby("A").agg( . b_min=pd.NamedAgg(column="B", aggfunc="min"), . c_sum=pd.NamedAgg(column="C", aggfunc="sum")) b_min c_sum A 1 1 0.590715 2 3 0.704907

The keywords are the output column names
The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields [‘column’, ‘aggfunc’] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.

Changed in version 1.3.0: The resulting dtype will reflect the return value of the aggregating function.

>>> df.groupby("A")[["B"]].agg(lambda x: x.astype(float).min()) B A 1 1.0 2 3.0

Источник