熊猫删除重复行– drop_duplicates（）函数

Pandas drop_duplicates（）函数语法 (Pandas drop_duplicates() Function Syntax)

Pandas drop_duplicates() function removes duplicate rows from the DataFrame. Its syntax is:

Pandas drop_duplicates（）函数从DataFrame中删除重复的行。其语法为：

drop_duplicates(self, subset=None, keep="first", inplace=False)

subset: column label or sequence of labels to consider for identifying duplicate rows. By default, all the columns are used to find the duplicate rows.子集：考虑用于标识重复行的列标签或标签序列。默认情况下，所有列均用于查找重复的行。
keep: allowed values are {‘first’, ‘last’, False}, default ‘first’. If ‘first’, duplicate rows except the first one is deleted. If ‘last’, duplicate rows except the last one is deleted. If False, all the duplicate rows are deleted.keep ：允许的值为{'first'，'last'，False}，默认为'first'。如果为“ first”，则删除除第一个行以外的重复行。如果为“ last”，则删除除最后一行以外的重复行。如果为False，则删除所有重复的行。
inplace: if True, the source DataFrame is changed and None is returned. By default, source DataFrame remains unchanged and a new DataFrame instance is returned.inplace ：如果为True，则更改源DataFrame并返回None。默认情况下，源DataFrame保持不变，并返回一个新的DataFrame实例。

熊猫掉落重复行示例 (Pandas Drop Duplicate Rows Examples)

Let’s look into some examples of dropping duplicate rows from a DataFrame object.

让我们看一些从DataFrame对象中删除重复行的示例。

1.删除重复的行以保持第一个 (1. Drop Duplicate Rows Keeping the First One)

This is the default behavior when no arguments are passed.

当不传递任何参数时，这是默认行为。

import pandas as pdd1 = {'A': [1, 1, 1, 2], 'B': [2, 2, 2, 3], 'C': [3, 3, 4, 5]}source_df = pd.DataFrame(d1)
print('Source DataFrame:\n', source_df)# keep first duplicate row
result_df = source_df.drop_duplicates()
print('Result DataFrame:\n', result_df)

Output:

输出：

Source DataFrame:A  B  C
0  1  2  3
1  1  2  3
2  1  2  4
3  2  3  5
Result DataFrame:A  B  C
0  1  2  3
2  1  2  4
3  2  3  5

The source DataFrame rows 0 and 1 are duplicates. The first occurrence is kept and the rest of the duplicates are deleted.

源DataFrame行0和1是重复的。保留第一个匹配项，其余重复项被删除。

2.删除重复项并保留最后一行 (2. Drop Duplicates and Keep Last Row)

result_df = source_df.drop_duplicates(keep='last')
print('Result DataFrame:\n', result_df)

Output:

输出：

Result DataFrame:A  B  C
1  1  2  3
2  1  2  4
3  2  3  5

The index ‘0’ is deleted and the last duplicate row ‘1’ is kept in the output.

索引“ 0”被删除，最后的重复行“ 1”保留在输出中。

3.从数据框删除所有重复的行 (3. Delete All Duplicate Rows from DataFrame)

result_df = source_df.drop_duplicates(keep=False)
print('Result DataFrame:\n', result_df)

Output:

输出：

Result DataFrame:A  B  C
2  1  2  4
3  2  3  5

Both the duplicate rows ‘0’ and ‘1’ are dropped from the result DataFrame.

重复的行“ 0”和“ 1”都从结果DataFrame中删除。

4.确定基于特定列的重复行 (4. Identify Duplicate Rows based on Specific Columns)

import pandas as pdd1 = {'A': [1, 1, 1, 2], 'B': [2, 2, 2, 3], 'C': [3, 3, 4, 5]}source_df = pd.DataFrame(d1)
print('Source DataFrame:\n', source_df)result_df = source_df.drop_duplicates(subset=['A', 'B'])
print('Result DataFrame:\n', result_df)

Output:

输出：

Source DataFrame:A  B  C
0  1  2  3
1  1  2  3
2  1  2  4
3  2  3  5
Result DataFrame:A  B  C
0  1  2  3
3  2  3  5

The columns ‘A’ and ‘B’ are used to identify duplicate rows. Hence, rows 0, 1, and 2 are duplicates. So, rows 1 and 2 are removed from the output.

列“ A”和“ B”用于标识重复的行。因此，行0、1和2是重复的。因此，从输出中删除了行1和2。

5.删除适当的重复行 (5. Remove Duplicate Rows in place)

source_df.drop_duplicates(inplace=True)
print(source_df)

Output:

输出：

参考资料 (References)

Python Pandas Module TutorialPython Pandas模块教程
Pandas DataFrame drop_duplicates() API Doc熊猫DataFrame drop_duplicates（）API文档

翻译自: https://www.journaldev.com/33488/pandas-drop-duplicate-rows-drop_duplicates-function