熊猫DataFrame groupby（）函数

1.熊猫groupby（）函数 (1. Pandas groupby() function)

Pandas DataFrame groupby() function is used to group rows that have the same values. It’s mostly used with aggregate functions (count, sum, min, max, mean) to get the statistics based on one or more column values.

Pandas DataFrame groupby（）函数用于对具有相同值的行进行分组。它通常与聚合函数（计数，总和，最小值，最大值，平均值）一起使用，以基于一个或多个列值获取统计信息。

Pandas gropuby() function is very similar to the SQL group by statement. Afterall, DataFrame and SQL Table are almost similar too. It’s an intermediary function to create groups before reaching the final result.

Pandas gropuby（）函数与SQL group by语句非常相似。毕竟，DataFrame和SQL Table也几乎相似。这是一个中间功能，可以在达到最终结果之前创建组。

2.拆分应用合并 (2. Split Apply Combine)

It’s also called the split-apply-combine process. The groupby() function splits the data based on some criteria. The aggregate function is applied to each of the groups and then combined together to create the result DataFrame. The below diagram illustrates this behavior with a simple example.

这也称为拆分应用合并过程。 groupby（）函数根据某些条件拆分数据。将聚合函数应用于每个组，然后组合在一起以创建结果DataFrame。下图通过一个简单的示例说明了此行为。

Split Apply Combine Example

拆分应用合并示例

3. Pandas DataFrame groupby（）语法 (3. Pandas DataFrame groupby() Syntax)

The groupby() function syntax is:

groupby（）函数的语法为：

groupby(self,by=None,axis=0,level=None,as_index=True,sort=True,group_keys=True,squeeze=False,observed=False,**kwargs)

The by argument determines the way to groupby elements. Generally, column names are used to group by the DataFrame elements.by参数确定分组元素的方式。通常，列名用于按DataFrame元素进行分组。
The axis parameter determines whether to grouby rows or columns.axis参数确定是对行还是对列进行处理。
The level is used with MultiIndex (hierarchical) to group by a particular level or levels.该级别与MultiIndex（分层）一起使用，以按一个或多个特定级别分组。
as_index specifies to return aggregated object with group labels as the index.as_index指定返回以组标签为索引的聚合对象。
The sort parameter is used to sort group keys. We can pass it as False for better performance with larger DataFrame objects.sort参数用于对组密钥进行排序。我们可以将其作为False传递，以获得更大的DataFrame对象更好的性能。
group_keys: when calling apply, add group keys to index to identify pieces.group_keys ：在调用apply时，将组密钥添加到索引以标识片段。
squeeze: Reduce the dimensionality of the return type if possible, otherwise return a consistent type.squeeze ：如果可能，减小返回类型的维数，否则返回一致的类型。
observed: If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.观察到的：如果为True：仅显示分类石斑鱼的观察到的值。如果为False：显示分类石斑鱼的所有值。
**kwargs: only accepts keyword argument ‘mutated’ and is passed to groupby.** kwargs ：仅接受关键字参数“ mutated”，并传递给groupby。

The groupby() function returns DataFrameGroupBy or SeriesGroupBy depending on the calling object.

groupby（）函数根据调用对象返回DataFrameGroupBy或SeriesGroupBy。

4.熊猫groupby（）示例 (4. Pandas groupby() Example)

Let’s say we have a CSV file with the below content.

假设我们有一个包含以下内容的CSV文件。

ID,Name,Role,Salary
1,Pankaj,Editor,10000
2,Lisa,Editor,8000
3,David,Author,6000
4,Ram,Author,4000
5,Anupam,Author,5000

We will use Pandas read_csv() function to read the CSV file and create the DataFrame object.

我们将使用Pandas的read_csv（）函数来读取CSV文件并创建DataFrame对象。

import pandas as pddf = pd.read_csv('records.csv')print(df)

Output:

输出：

ID    Name    Role  Salary
0   1  Pankaj  Editor   10000
1   2    Lisa  Editor    8000
2   3   David  Author    6000
3   4     Ram  Author    4000
4   5  Anupam  Author    5000

4.1）平均工资按角色分组 (4.1) Average Salary Group By Role)

We want to know the average salary of the employees based on their role. So we will use groupby() function to create groups based on the ‘Role’ column. Then call the aggregate function mean() to calculate the average and produce the result. Since we don’t need ID and Name columns, we will remove them from the output.

我们想根据员工的角色知道他们的平均工资。因此，我们将使用groupby（）函数基于“角色”列创建组。然后调用聚合函数mean（）来计算平均值并产生结果。由于我们不需要ID和Name列，因此我们将从输出中将其删除。

df_groupby_role = df.groupby(['Role'])# select only required columns
df_groupby_role = df_groupby_role[["Role", "Salary"]]# get the average
df_groupby_role_mean = df_groupby_role.mean()print(df_groupby_role_mean)

Output:

输出：

Salary
Role
Author    5000
Editor    9000

The indexes in the output don’t look good. We can fix it by calling the reset_index() function.

输出中的索引看起来不好。我们可以通过调用reset_index（）函数来修复它。

df_groupby_role_mean = df_groupby_role_mean.reset_index()
print(df_groupby_role_mean)

Output:

输出：

Role  Salary
0  Author    5000
1  Editor    9000

4.2）按角色支付的总工资 (4.2) Total Salary Paid By Role)

In this example, we will calculate the salary paid for each role.

在此示例中，我们将计算为每个角色支付的薪水。

df_salary_by_role = df.groupby(['Role'])[["Role", "Salary"]].sum().reset_index()
print(df_salary_by_role)

Output:

输出：

Role  Salary
0  Author   15000
1  Editor   18000

This example looks simple because everything is done in a single line. In the earlier example, I had divided the steps for clarity.

该示例看起来很简单，因为所有操作都在一行中完成。在前面的示例中，为清晰起见，我将步骤分为几部分。

4.3）按角色划分的员工总数 (4.3) Total Number of Employees by Role)

We can use size() aggregate function to get this data.

我们可以使用size（）聚合函数来获取此数据。

df_size_by_role = df.groupby(['Role']).size().reset_index()
df_size_by_role.columns.values[1] = 'Count'  # renaming the size column
print(df_size_by_role)

Output:

输出：

Role  Count
0  Author      3
1  Editor      2

5.参考 (5. References)

Pandas group by: split-apply-combine熊猫分组方式：split-apply-combine
Pandas DataFrame groupby() API Doc熊猫DataFrame groupby（）API文档

翻译自: https://www.journaldev.com/33402/pandas-dataframe-groupby-function