python 数据挖掘图书

Pandas (which is a portmanteau of "panel data") is one of the most important packages to grasp when you’re starting to learn Python.

Pandas(这是“面板数据”的缩写)是您开始学习Python时要掌握的最重要的软件包之一。

The package is known for a very useful data structure called the pandas DataFrame. Pandas also allows Python developers to easily deal with tabular data (like spreadsheets) within a Python script.

该软件包以称为pandas DataFrame的非常有用的数据结构而闻名。 Pandas还允许Python开发人员轻松地在Python脚本中处理表格数据(例如电子表格)。

This tutorial will teach you the fundamentals of pandas that you can use to build data-driven Python applications today.

本教程将教您熊猫的基本知识,您现在可以使用它们来构建数据驱动的Python应用程序。

目录 (Table of Contents)

You can skip to a specific section of this pandas tutorial using the table of contents below:

您可以使用以下目录跳至本熊猫教程的特定部分:

  • Introduction to Pandas

    熊猫介绍

  • Pandas Series

    熊猫系列

  • Pandas DataFrames

    熊猫数据框

  • How to Deal With Missing Data in Pandas DataFrames

    如何处理Pandas Dat aFrame中的丢失数据

  • The Pandas groupby Method

    熊猫groupby

  • What is the Pandas groupby Feature?

    什么是Pandas groupby功能?

  • The Pandas concat Method

    熊猫concat

  • The Pandas merge Method

    熊猫merge方法

  • The Pandas join Method

    熊猫join方法

  • Other Common Operations in Pandas

    熊猫的其他常见行动

  • Local Data Input and Output (I/O) in Pandas

    熊猫的本地数据输入和输出(I / O)

  • Remote Data Input and Output (I/O) in Pandas

    熊猫中的远程数据输入和输出(I / O)

  • Final Thoughts & Special Offer

    最后的想法和特别优惠

熊猫介绍 (Introduction to Pandas)

Pandas is a widely-used Python library built on top of NumPy. Much of the rest of this course will be dedicated to learning about pandas and how it is used in the world of finance.

Pandas是建立在NumPy之上的广泛使用的Python库。 本课程的其余大部分内容将致力于学习有关熊猫及其在金融界的用途。

什么是熊猫? (What is Pandas?)

Pandas is a Python library created by Wes McKinney, who built pandas to help work with datasets in Python for his work in finance at his place of employment.

Pandas是由Wes McKinney创建的Python库,他构建了pandas来帮助使用Python中的数据集工作,从而在他的工作地点从事金融工作。

According to the library’s website, pandas is “a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.”

根据图书馆的网站 ,pandas是“一种快速,强大,灵活且易于使用的开源数据分析和处理工具,建立在Python编程语言之上。”

Pandas stands for ‘panel data’. Note that pandas is typically stylized as an all-lowercase word, although it is considered a best practice to capitalize its first letter at the beginning of sentences.

熊猫代表“面板数据”。 请注意,尽管熊猫被视为在句子开头大写首字母的最佳实践,但通常将其大写为小写字母。

Pandas is an open source library, which means that anyone can view its source code and make suggestions using pull requests. If you are curious about this, visit the pandas source code repository on GitHub

Pandas是一个开放源代码库,这意味着任何人都可以查看其源代码并使用请求请求提出建议。 如果您对此感到好奇,请访问GitHub上的pandas源代码存储库

熊猫的主要好处 (The Main Benefit of Pandas)

Pandas was designed to work with two-dimensional data (similar to Excel spreadsheets). Just as the NumPy library had a built-in data structure called an array with special attributes and methods, the pandas library has a built-in two-dimensional data structure called a DataFrame.

Pandas旨在处理二维数据(类似于Excel电子表格)。 就像NumPy库具有内置数据结构(称为具有特殊属性和方法的array ,pandas库具有内置二维数据结构(称为DataFrame

我们将对熊猫学到什么 (What We Will Learn About Pandas)

As we mentioned earlier in this course, advanced Python practitioners will spend much more time working with pandas than they spend working with NumPy.

正如我们在本课程中前面提到的,与使用NumPy相比,高级Python从业人员与熊猫工作所花费的时间要多得多。

Over the next several sections, we will cover the following information about the pandas library:

在接下来的几节中,我们将介绍有关pandas库的以下信息:

  • Pandas Series熊猫系列
  • Pandas DataFrames熊猫数据框
  • How To Deal With Missing Data in Pandas如何处理熊猫中的丢失数据
  • How To Merge DataFrames in Pandas如何在熊猫中合并数据框
  • How To Join DataFrames in Pandas如何在熊猫中加入数据框
  • How To Concatenate DataFrames in Pandas如何串联熊猫中的数据框
  • Common Operations in Pandas熊猫的共同行动
  • Data Input and Output in Pandas熊猫的数据输入和输出
  • How To Save Pandas DataFrames as Excel Files for External Users如何将Pandas DataFrames保存为Excel文件供外部用户使用

熊猫系列 (Pandas Series)

In this section, we’ll be exploring pandas Series, which are a core component of the pandas library for Python programming.

在本节中,我们将探讨pandas系列 ,这是pandas库中用于Python编程的核心组件。

什么是熊猫系列? (What Are Pandas Series?)

Series are a special type of data structure available in the pandas Python library. Pandas Series are similar to NumPy arrays, except that we can give them a named or datetime index instead of just a numerical index.

系列是pandas Python库中可用的一种特殊类型的数据结构。 Pandas Series与NumPy数组相似,除了我们可以给它们指定一个命名索引或日期时间索引,而不仅仅是数字索引。

您需要使用Pandas系列的进口商品 (The Imports You’ll Require To Work With Pandas Series)

To work with pandas Series, you’ll need to import both NumPy and pandas, as follows:

要使用pandas Series,您需要同时导入NumPy和pandas,如下所示:

import numpy as npimport pandas as pd

For the rest of this section, I will assume that both of those imports have been executed before running any code blocks.

对于本节的其余部分,我将假定在运行任何代码块之前都已执行了所有这些导入。

如何创建熊猫系列 (How To Create a Pandas Series)

There are a number of different ways to create a pandas Series. We will explore all of them in this section.

创建熊猫系列有多种方法。 我们将在本节中探索所有这些。

First, let’s create a few starter variables - specifically, we’ll create two lists, a NumPy array, and a dictionary.

首先,让我们创建一些入门变量-具体来说,我们将创建两个列表,一个NumPy数组和一个字典。

labels = ['a', 'b', 'c']my_list = [10, 20, 30]arr = np.array([10, 20, 30])d = {'a':10, 'b':20, 'c':30}

The easiest way to create a pandas Series is by passing a vanilla Python list into the pd.Series() method. We do this with the my_list variable below:

创建pandas系列的最简单方法是将普通的Python列表传递到pd.Series()方法中。 我们使用下面的my_list变量执行此操作:

pd.Series(my_list)

If you run this in your Jupyter Notebook, you will notice that the output is quite different than it is for a normal Python list:

如果在Jupyter Notebook中运行此命令,您将注意到输出与正常的Python列表完全不同:

0    101    202    30dtype: int64

The output shown above is clearly designed to present as two columns. The second column is the data from my_list. What is the first column?

上面显示的输出清楚地设计为显示为两列。 第二列是my_list的数据。 第一栏是什么?

One of the key advantages of using pandas Series over NumPy arrays is that they allow for labeling. As you might have guessed, that first column is a column of labels.

与NumPy数组相比,使用Pandas Series的主要优势之一是它们允许标记。 您可能已经猜到了,第一列是标签列。

We can add labels to a pandas Series using the index argument like this:

我们可以使用index参数将标签添加到pandas系列中,如下所示:

pd.Series(my_list, index=labels)#Remember - we created the 'labels' list earlier in this section

The output of this code is below:

此代码的输出如下:

a    10b    20c    30dtype: int64

Why would you want to use labels in a pandas Series? The main advantage is that it allows you to reference an element of the Series using its label instead of its numerical index. To be clear, once labels have been applied to a pandas Series, you can use either its numerical index or its label.

您为什么要在熊猫系列中使用标签? 主要优点是它允许您使用其标签而不是其数字索引来引用该系列的元素。 需要明确的是,一旦标签已被应用到熊猫系列,您可以使用其数字索引或者它的标签。

An example of this is below.

下面是一个示例。

Series = pd.Series(my_list, index=labels)Series[0]#Returns 10Series['a']#Also returns 10

You might have noticed that the ability to reference an element of a Series using its label is similar to how we can reference the value of a key-value pair in a dictionary. Because of this similarity in how they function, you can also pass in a dictionary to create a pandas Series. We’ll use the d={'a': 10, 'b': 20, 'c': 30} that we created earlier as an example:

您可能已经注意到,参考使用一系列其标签的元素的能力,类似于我们如何可以参考value一的key - value在字典中对。 由于它们在功能上的相似之处,您还可以传入字典来创建pandas系列。 我们以之前创建的d={'a': 10, 'b': 20, 'c': 30}为例:

pd.Series(d)

This code’s output is:

该代码的输出为:

a    10b    20c    30dtype: int64

It may not yet be clear why we have explored two new data structures (NumPy arrays and pandas Series) that are so similar. In the next section of this section, we’ll explore the main advantage of pandas Series over NumPy arrays.

尚不清楚为什么我们要探索两个如此相似的新数据结构(NumPy数组和pandas系列)。 在本节的下一部分中,我们将探讨pandas系列相对于NumPy数组的主要优势。

熊猫系列相对于NumPy阵列的主要优势 (The Main Advantage of Pandas Series Over NumPy Arrays)

While we didn’t encounter it at the time, NumPy arrays are highly limited by one characteristic: every element of a NumPy array must be the same type of data structure. Said differently, NumPy array elements must be all string, or all integers, or all booleans - you get the point.

尽管我们当时还没有遇到过,但是NumPy数组受到一个特性的高度限制:NumPy数组的每个元素都必须是相同类型的数据结构。 换句话说,NumPy数组元素必须全部为字符串,或者全部为整数,或者全部为布尔值-您明白了。

Pandas Series do not suffer from this limitation. In fact, pandas Series are highly flexible.

熊猫系列不受此限制。 实际上,熊猫系列是高度灵活的。

As an example, you can pass three of Python’s built-in functions into a pandas Series without getting an error:

例如,您可以将Python的三个内置函数传递给pandas Series,而不会出现错误:

pd.Series([sum, print, len])

Here’s the output of that code:

这是该代码的输出:

0      <built-in function sum>1    <built-in function print>2      <built-in function len>dtype: object

To be clear, the example above is highly impractical and not something we would ever execute in practice. It is, however, an excellent example of the flexibility of the pandas Series data structure.

需要明确的是,上面的示例非常不切实际,不是我们在实践中可以执行的示例。 但是,它是熊猫系列数据结构灵活性的一个很好的例子。

熊猫数据框 (Pandas DataFrames)

NumPy allows developers to work with both one-dimensional NumPy arrays (sometimes called vectors) and two-dimensional NumPy arrays (sometimes called matrices). We explored pandas Series in the last section, which are similar to one-dimensional NumPy arrays.

NumPy允许开发人员使用一维NumPy数组(有时称为向量)和二维NumPy数组(有时称为矩阵)。 在上一节中,我们探讨了熊猫系列,它们与一维NumPy数组相似。

In this section, we will dive into pandas DataFrames, which are similar to two-dimensional NumPy arrays - but with much more functionality. DataFrames are the most important data structure in the pandas library, so pay close attention throughout this section.

在本节中,我们将深入研究pandas DataFrames ,它类似于二维NumPy数组-但功能更多。 DataFrames是pandas库中最重要的数据结构,因此在本节中请密切注意。

什么是熊猫数据框? (What Is A Pandas DataFrame?)

A pandas DataFrame is a two-dimensional data structure that has labels for both its rows and columns. For those familiar with Microsoft Excel, Google Sheets, or other spreadsheet software, DataFrames are very similar.

pandas DataFrame是一个二维数据结构,它的行和列都有标签。 对于熟悉Microsoft Excel,Google表格或其他电子表格软件的用户,DataFrames非常相似。

Here is an example of a pandas DataFrame being displayed within a Jupyter Notebook.

这是在Jupyter Notebook中显示的熊猫DataFrame的示例。

We will now go through the process of recreating this DataFrame step-by-step.

现在,我们将逐步完成重新创建此DataFrame的过程。

First, you’ll need to import both the NumPy and pandas libraries. We have done this before, but in case you’re unsure, here’s another example of how to do that:

首先,您需要同时导入NumPy和pandas库。 我们之前已经做过,但是如果您不确定,这是如何执行此操作的另一个示例:

import numpy as npimport pandas as pd

We’ll also need to create lists for the row and column names. We can do this using vanilla Python lists:

我们还需要为行和列名称创建列表。 我们可以使用原始的Python列表来做到这一点:

rows = ['X','Y','Z']cols = ['A', 'B', 'C', 'D', 'E']

Next, we will need to create a NumPy array that holds the data contained within the cells of the DataFrame. I used NumPy’s np.random.randn method for this. I also wrapped that method in the np.round method (with a second argument of 2), which rounds each data point to 2 decimal places and makes the data structure much easier to read.

接下来,我们将需要创建一个NumPy数组,用于保存DataFrame单元中包含的数据。 我np.random.randn使用了NumPy的np.random.randn方法。 我还将该方法包装在np.round方法中(第二个参数为2 ),该方法将每个数据点np.round到2个小数位,并使数据结构更易于阅读。

Here’s the final function that generated the data.

这是生成数据的最终函数。

data = np.round(np.random.randn(3,5),2)

Once this is done, you can wrap all of the constituent variables in the pd.DataFrame method to create your first DataFrame!

完成此操作后,您可以将所有组成变量包装在pd.DataFrame方法中,以创建第一个DataFrame!

pd.DataFrame(data, rows, cols)

There is a lot to unpack here, so let’s discuss this example in a bit more detail.

这里有很多要解压的内容,因此让我们更详细地讨论这个示例。

First, it is not necessary to create each variable outside of the DataFrame itself. You could have created this DataFrame in one line like this:

首先,没有必要在DataFrame本身之外创建每个变量。 您可以像这样在一行中创建此DataFrame:

pd.DataFrame(np.round(np.random.randn(3,5),2), ['X','Y','Z'], ['A', 'B', 'C', 'D', 'E'])

With that said, declaring each variable separately makes the code much easier to read.

话虽如此,分别声明每个变量使代码更易于阅读。

Second, you might be wondering if it is necessary to put rows into the DataFrame method before columns. It is indeed necessary. If you tried running pd.DataFrame(data, cols, rows), your Jupyter Notebook would generate the following error message:

其次,您可能想知道是否有必要将行放入DataFrame方法中的列之前。 确实是必要的。 如果您尝试运行pd.DataFrame(data, cols, rows) ,您的Jupyter Notebook将生成以下错误消息:

ValueError: Shape of passed values is (3, 5), indices imply (5, 3)

Next, we will explore the relationship between pandas Series and pandas DataFrames.

接下来,我们将探讨pandas系列和pandas DataFrames之间的关系。

熊猫系列与Pandas DataFrame之间的关系 (The Relationship Between Pandas Series and Pandas DataFrame)

Let’s take another look at the pandas DataFrame that we just created:

让我们再看一下刚刚创建的pandas DataFrame:

If you had to verbally describe a pandas Series, one way to do so might be “a set of labeled columns containing data where each column shares the same set of row index.”

如果您必须用口头描述熊猫系列,一种方法可能是“ 一组包含数据的带标签的列,每个列共享相同的行索引集。”

Interestingly enough, each of these columns is actually a pandas Series! So we can modify our definition of the pandas DataFrame to match its formal definition:

有趣的是,这些专栏实际上都是熊猫系列! 因此,我们可以修改pandas DataFrame的定义以使其符合其正式定义:

A set of pandas Series that shares the same index.”

一组具有相同索引的熊猫系列。”

熊猫数据框架中的索引和分配 (Indexing and Assignment in Pandas DataFrames)

We can actually call a specific Series from a pandas DataFrame using square brackets, just like how we call a element from a list. A few examples are below:

实际上,我们可以使用方括号从pandas DataFrame调用特定的Series,就像我们从列表中调用元素一样。 以下是一些示例:

df = pd.DataFrame(data, rows, cols)df['A']"""Returns:X   -0.66Y   -0.08Z    0.64Name: A, dtype: float64"""df['E']"""Returns:X   -1.46Y    1.71Z   -0.20Name: E, dtype: float64"""

What if you wanted to select multiple columns from a pandas DataFrame? You can pass in a list of columns, either directly in the square brackets - such as df[['A', 'E']] - or by declaring the variable outside of the square brackets like this:

如果要从pandas DataFrame中选择多个列怎么办? 您可以直接在方括号中传递列的列表,例如df[['A', 'E']] -或通过在方括号之外声明变量,如下所示:

columnsIWant = ['A', 'E']df[columnsIWant]#Returns the DataFrame, but only with columns A and E

You can also select a specific element of a specific row using chained square brackets. For example, if you wanted the element contained in row A at index X (which is the element in the top left cell of the DataFrame) you could access it with df['A']['X'].

您还可以使用链式方括号选择特定行的特定元素。 例如,如果您希望包含在索引X的行A中的元素(这是DataFrame左上角的元素),则可以使用df['A']['X']

A few other examples are below.

下面是其他一些示例。

df['B']['Z']#Returns 1.34df['D']['Y']#Returns -0.64

如何在Pandas DataFrame中创建和删除列 (How To Create and Remove Columns in a Pandas DataFrame)

You can create a new column in a pandas DataFrame by specifying the column as though it already exists, and then assigning it a new pandas Series.

您可以在熊猫数据框架中创建新列,方法是将其指定为已存在,然后为其分配新的熊猫系列。

As an example, in the following code block we create a new column called ‘A + B’ which is the sum of columns A and B:

例如,在下面的代码块中,我们创建一个名为“ A + B”的新列,该列是A和B列的总和:

df['A + B'] = df['A'] + df['B']df #The last line prints out the new DataFrame

Here’s the output of that code block:

这是该代码块的输出:

To remove this column from the pandas DataFrame, we need to use the pd.DataFrame.drop method.

要从pandas DataFrame中删除此列,我们需要使用pd.DataFrame.drop方法。

Note that this method defaults to dropping rows, not columns. To switch the method settings to operate on columns, we must pass it in the axis=1 argument.

请注意,此方法默认为删除行,而不是列。 要切换方法设置以对列进行操作,我们必须在axis=1参数中传递它。

df.drop('A + B', axis = 1)

It is very important to note that this drop method does not actually modify the DataFrame itself. For evidence of this, print out the df variable again, and notice how it still has the A + B column:

非常重要的一点是要注意,此drop方法实际上并未修改DataFrame本身。 为了证明这一点,请再次打印出df变量,并注意它仍然具有A + B列:

df

The reason that drop (and many other DataFrame methods!) do not modify the data structure by default is to prevent you from accidentally deleting data.

默认情况下drop (以及许多其他DataFrame方法!)不修改数据结构的原因是为了防止您意外删除数据。

There are two ways to make pandas automatically overwrite the current DataFrame.

有两种方法可以使熊猫自动覆盖当前的DataFrame。

The first is by passing in the argument inplace=True, like this:

第一种是通过传入参数inplace=True ,如下所示:

df.drop('A + B', axis=1, inplace=True)

The second is by using an assignment operator that manually overwrites the existing variable, like this:

第二种是通过使用赋值运算符来手动覆盖现有变量,如下所示:

df = df.drop('A + B', axis=1)

Both options are valid but I find myself using the second option more frequently because it is easier to remember.

这两个选项均有效,但我发现自己更经常使用第二个选项,因为它更容易记住。

The drop method can also be used to drop rows. For example, we can remove the row Z as follows:

drop方法也可以用于删除行。 例如,我们可以如下删除Z行:

df.drop('Z')

如何从Pandas DataFrame中选择一行 (How To Select A Row From A Pandas DataFrame)

We have already seen that we can access a specific column of a pandas DataFrame using square brackets. We will now see how to access a specific row of a pandas DataFrame, with the similar goal of generating a pandas Series from the larger data structure.

我们已经看到我们可以使用方括号访问pandas DataFrame的特定列。 现在,我们将看到如何访问pandas DataFrame的特定行,其相似的目标是从较大的数据结构中生成pandas系列。

DataFrame rows can be accessed by their row label using the loc attribute along with square brackets. An example is below.

可以使用loc属性和方括号通过其行标签访问DataFrame行。 下面是一个示例。

df.loc['X']

Here is the output of that code:

这是该代码的输出:

A   -0.66B   -1.43C   -0.88D    1.60E   -1.46Name: X, dtype: float64

DataFrame rows can be accessed by their numerical index using the iloc attribute along with square brackets. An example is below.

可以使用iloc属性及其方括号通过其数字索引访问DataFrame行。 下面是一个示例。

df.iloc[0]

As you would expect, this code has the same output as our last example:

如您所料,此代码与上一个示例具有相同的输出:

A   -0.66B   -1.43C   -0.88D    1.60E   -1.46Name: X, dtype: float64

如何确定Pandas DataFrame中的行数和列数 (How To Determine The Number Of Rows and Columns in a Pandas DataFrame)

There are many cases where you’ll want to know the shape of a pandas DataFrame. By shape, I am referring to the number of columns and rows in the data structure.

在许多情况下,您需要了解熊猫DataFrame的形状。 按形状,我指的是数据结构中的列数和行数。

Pandas has a built-in attribute called shape that allows us to easily access this:

Pandas具有一个名为shape的内置属性,该属性使我们可以轻松访问此属性:

df.shape#Returns (3, 5)

切片熊猫数据框 (Slicing Pandas DataFrames)

We have already seen how to select rows, columns, and elements from a pandas DataFrame. In this section, we will explore how to select a subset of a DataFrame. Specifically, let’s select the elements from columns A and B and rows X and Y.

我们已经看到了如何从pandas DataFrame中选择行,列和元素。 在本节中,我们将探讨如何选择DataFrame的子集。 具体来说,让我们从AB列以及XY行中选择元素。

We can actually approach this in a step-by-step fashion. First, let’s select columns A and B:

实际上,我们可以逐步解决此问题。 首先,让我们选择AB列:

df[['A', 'B']]

Then, let’s select rows X and Y:

然后,让我们选择XY行:

df[['A', 'B']].loc[['X', 'Y']]

And we’re done!

我们完成了!

使用Pandas DataFrame进行条件选择 (Conditional Selection Using Pandas DataFrame)

If you recall from our discussion of NumPy arrays, we were able to select certain elements of the array using conditional operators. For example, if we had a NumPy array called arr and we only wanted the values of the array that were larger than 4, we could use the command arr[arr > 4].

如果您回想起有关NumPy数组的讨论,则可以使用条件运算符选择数组的某些元素。 例如,如果我们有一个名为arr的NumPy数组,而我们只希望该数组的值大于4,则可以使用命令arr[arr > 4]

Pandas DataFrames follow a similar syntax. For example, if we wanted to know where our DataFrame has values that were greater than 0.5, we could type df > 0.5 to get the following output:

Pandas DataFrame遵循类似的语法。 例如,如果我们想知道DataFrame的值大于0.5的位置,则可以键入df > 0.5以获得以下输出:

We can also generate a new pandas DataFrame that contains the normal values where the statement is True, and NaN - which stands for Not a Number - values where the statement is false. We do this by passing the statement into the DataFrame using square brackets, like this:

我们也可以产生新的数据框熊猫包含正常值,其中语句是True ,并且NaN -它代表不是一个数字-值,其中的说法是错误的。 为此,我们使用方括号将语句传递到DataFrame中,如下所示:

df[df > 0.5]

Here is the output of that code:

这是该代码的输出:

You can also use conditional selection to return a subset of the DataFrame where a specific condition is satisfied in a specified column.

您还可以使用条件选择来返回DataFrame的子集,其中指定列中满足特定条件。

To be more specific, let’s say that you wanted the subset of the DataFrame where the value in column C was less than 1. This is only true for row X.

更具体地说,假设您想要DataFrame的子集,其中C列中的值小于1。这仅适用于X行。

You can get an array of the boolean values associated with this statement like this:

您可以像下面这样获得与该语句关联的布尔值数组:

df['C'] < 1

Here’s the output:

这是输出:

X     TrueY    FalseZ    FalseName: C, dtype: bool

You can also get the DataFrame’s actual values relative to this conditional selection command by typing df[df['C'] < 1], which outputs just the first row of the DataFrame (since this is the only row where the statement is true for column C:

您还可以通过键入df[df['C'] < 1]来获得与该条件选择命令有关的DataFrame的实际值,它仅输出DataFrame的第一行(因为这是该语句为true的唯一行) C栏:

You can also chain together multiple conditions while using conditional selection. We do this using pandas’ & operator. You cannot use Python’s normal and operator, because in this case we are not comparing two boolean values. Instead, we are comparing two pandas Series that contain boolean values, which is why the & character is used instead.

您还可以在使用条件选择时将多个条件链接在一起。 我们使用pandas的&运算符执行此操作。 您不能使用Python的normal and operator,因为在这种情况下,我们不比较两个布尔值。 相反,我们正在比较两个包含布尔值的熊猫系列,这就是为什么使用&字符代替的原因。

As an example of multiple conditional selection, you can return the DataFrame subset that satisfies df['C'] > 0 and df['A']> 0 with the following code:

作为多个条件选择的示例,您可以使用以下代码返回满足df['C'] > 0df['A']> 0的DataFrame子集:

df[(df['C'] > 0) & (df['A']> 0)]

如何修改Pandas DataFrame的索引 (How To Modify The Index of a Pandas DataFrame)

There are a number of ways that you can modify the index of a pandas DataFrame.

您可以通过多种方式修改熊猫DataFrame的索引。

The most basic is to reset the index to its default numerical values. We do this using the reset_index method:

最基本的是将索引重置为其默认数值。 我们使用reset_index方法执行此操作:

df.reset_index()

Note that this creates a new column in the DataFrame called index that contains the previous index labels:

请注意,这会在DataFrame中创建一个名为index的新列,其中包含以前的索引标签:

Note that like the other DataFrame operations that we have explored, reset_index does not modify the original DataFrame unless you either (1) force it to using the = assignment operator or (2) specify inplace=True.

请注意,像我们探索的其他DataFrame操作一样, reset_index不会修改原始DataFrame,除非您(1)强制它使用=赋值运算符或(2)指定inplace=True

You can also set an existing column as the index of the DataFrame using the set_index method. We can set column A as the index of the DataFrame using the following code:

您还可以使用set_index方法将现有列设置为DataFrame的索引。 我们可以使用以下代码将列A设置为DataFrame的索引:

df.set_index('A')

The values of A are now in the index of the DataFrame:

现在, A的值在DataFrame的索引中:

There are three things worth noting here:

这里有三件事值得注意:

  • set_index does not modify the original DataFrame unless you either (1) force it to using the = assignment operator or (2) specify inplace=True.

    set_index不会修改原始DataFrame,除非您(1)强制使用=赋值运算符或(2)指定inplace=True

  • Unless you run reset_index first, performing a set_index operation with inplace=True or a forced = assignment operator will permanently overwrite your current index values.

    除非您先运行reset_index ,否则使用set_index inplace=Trueset_index =赋值运算符执行set_index操作将永久覆盖您当前的索引值。

  • If you want to rename your index to labels that are not currently contained in a column, you can do so by (1) creating a NumPy array with those values, (2) adding those values as a new row of the pandas DataFrame, and (3) running the set_index operation.

    如果您想将索引重命名为当前列中不包含的标签,可以通过(1)使用这些值创建一个NumPy数组,(2)将这些值添加为pandas DataFrame的新行,然后将其重命名。 (3)运行set_index操作。

如何重命名Pandas DataFrame中的列 (How To Rename Columns in a Pandas DataFrame)

The last DataFrame operation we’ll discuss is how to rename their columns.

我们将讨论的最后一个DataFrame操作是如何重命名其列。

Columns are an attribute of a pandas DataFrame, which means we can call them and modify them using a simple dot operator. For example:

列是pandas DataFrame的属性,这意味着我们可以调用它们并使用简单的点运算符对其进行修改。 例如:

df.columns#Returns Index(['A', 'B', 'C', 'D', 'E'], dtype='object'

The assignment operator is the best way to modify this attribute:

赋值运算符是修改此属性的最佳方法:

df.columns = [1, 2, 3, 4, 5]df

如何处理Pandas DataFrames中的丢失数据 (How to Deal With Missing Data in Pandas DataFrames)

In an ideal world we will always work with perfect data sets. However, this is never the case in practice. There are many cases when working with quantitative data that you will need to drop or modify missing data. We will explore strategies for handling missing data in Pandas throughout this section.

在理想的世界中,我们将始终使用完善的数据集。 但是,实际情况并非如此。 在处理定量数据时,有很多情况需要删除或修改丢失的数据。 在本节中,我们将探讨处理熊猫中缺失数据的策略。

本节中将使用的DataFrame (The DataFrame We’ll Be Using In This section)

We will be using the np.nan attribute to generate NaN values throughout this section.

在本节中,我们将使用np.nan属性生成NaN值。

Np.nan#Returns nan

In this section, we will make use of the following DataFrame:

在本节中,我们将使用以下DataFrame:

df = pd.DataFrame(np.array([[1, 5, 1],[2, np.nan, 2],[np.nan, np.nan, 3]]))df.columns = ['A', 'B', 'C']df

熊猫dropna方法 (The Pandas dropna Method)

Pandas has a built-in method called dropna. When applied against a DataFrame, the dropna method will remove any rows that contain a NaN value.

熊猫有一个内置的方法dropna 。 当对DataFrame应用时, dropna方法将删除任何包含NaN值的行。

Let’s apply the dropna method to our df DataFrame as an example:

让我们将dropna方法应用于df DataFrame作为示例:

df.dropna()

Note that like the other DataFrame operations that we have explored, dropna does not modify the original DataFrame unless you either (1) force it to using the = assignment operator or (2) specify inplace=True.

请注意,就像我们探索的其他DataFrame操作一样, dropna不会修改原始DataFrame,除非您(1)强制使用=赋值运算符或(2)指定inplace=True

We can also drop any columns that have missing values by passing in the axis=1 argument to the dropna method, like this:

我们还可以通过将axis=1参数传递给dropna方法来删除任何缺少值的列,如下所示:

df.dropna(axis=1)

熊猫fillna方法 (The Pandas fillna Method)

In many cases, you will want to replace missing values in a pandas DataFrame instead of dropping it completely. The fillna method is designed for this.

在许多情况下,您将希望替换熊猫DataFrame中的缺失值,而不是完全删除它。 fillna ,设计了fillna方法。

As an example, let’s fill every missing value in our DataFrame with the

python 数据挖掘图书_Python数据科学熊猫图书馆终极指南相关推荐

  1. python 桌面备忘录_Python数据科学备忘单

    python 桌面备忘录 The printable version of this cheat sheet 该备忘单的可打印版本 It's common when first learning Py ...

  2. 用python画统计图表_Python数据科学(九)- 使用Pandas绘制统计图表

    作者:许胜利 Python爱好者社区专栏作者 博客专栏:许胜利的博客专栏 1.信息可视化 因为人对图像信息的解析效率比文字更高,所以可视化可以使数据更为直观,便于理解,使决策变得高效,所以信息可视化就 ...

  3. python 方差齐性检验_Python数据科学:正态分布与t检验

    昨天介绍了两连续变量的相关分析,今天来说说连续变量与分类变量(二分)之间的检验. 通俗的来讲,就是去发现变量间的关系. 连续变量数量为一个,分类变量数量为两个. 总体:包含所有研究个体的集合. 样本: ...

  4. python 方差齐性检验_Python数据科学:方差分析

    之前已经介绍的变量分析: ①相关分析:一个连续变量与一个连续变量间的关系. ②双样本t检验:一个二分分类变量与一个连续变量间的关系. 本次介绍: 方差分析:一个多分类分类变量与一个连续变量间的关系. ...

  5. 支持向量机python人脸识别_Python 数据科学手册 5.7 支持向量机

    5.7 支持向量机 支持向量机(SVM)是一种特别强大且灵活的监督算法,用于分类和回归. 在本节中,我们将探索支持向量机背后的直觉,及其在分类问题中的应用. 我们以标准导入开始: %matplotli ...

  6. python数据科学实践 常象宇_python数据科学

    Python语言拥有大量可用于存储.操作和洞察数据的程序库,已然成为深受数据科学研究人员推崇的工具.本书以IPython.NumPy.Pandas.Matplotlib和Scikit-Learn这5个 ...

  7. 基于MaxCompute分布式Python能力的大规模数据科学分析

    简介: 如何利用云上分布式 Python 加速数据科学. 如果你熟悉 numpy.pandas 或者 sklearn 这样的数据科学技术栈,同时又受限于平台的计算性能无法处理,本文介绍的 MaxCom ...

  8. python 数据挖掘 之 对数据进行简单预处理(1)

    python 数据挖掘 之 对数据进行简单预处理 在我们对数据集进行数据挖掘之前,需要先对数据集进行简单的处理,让数据集变得更规范更具有代表性. 对数据集进行的预处理又许多种,接下来我就简单说几种常用 ...

  9. python数据挖掘商业案例_Python数据科学-技术详解与商业实践-第八讲作业

    作者:Ben,著有<Python数据科学:技术详解与商业实践>.<用商业案例学R语言数据挖掘>.<胸有成竹-数据分析的SAS EG进阶>作者.2005年进入数据科学 ...

最新文章

  1. 闲着无聊去体验远程面试,最后竟然被录取了...
  2. python数组写入txt文档_Python打开文件,将list、numpy数组内容写入txt文件中的方法...
  3. 从架构演进的角度聊聊Spring Cloud都做了些什么?
  4. android隐藏状态栏
  5. CV之detectron2:detectron2的简介、安装、使用方法之详细攻略
  6. SAP S/4 Hana On-premise Edition 1511做了哪些简化
  7. [云炬创业管理笔记]第二章测试1
  8. OneNET微信平台授课笔记
  9. C++ _countf
  10. 第二个情人节表白网页源码
  11. 腾讯否认“PC端QQ秀下线”:只是在聊天窗口被折叠
  12. PreparedStatement对象
  13. 吴恩达深度学习的改善深层神经网络编程作业:优化Optimization
  14. Linux内存初始化(四) 创建系统内存地址映射
  15. 5. 使用字符串库函数
  16. declare sql语句_SQL高级知识——动态SQL
  17. java 类注释标准_Java 标准注释
  18. 华为ENSP安装教程
  19. Hi, This is CarPlay!
  20. R语言详解参数检验和非参数检验——样本T检验、方差分析、pearson相关性检验、单样本wilcoxon检验、Mann-Whitney检验、配对样本wilcoxon检验、列联表检验、卡方检验

热门文章

  1. 锂离子电池被动均衡深度理解
  2. 最好的五款骨传导耳机推荐,双十一必入骨传导蓝牙耳机
  3. 论文翻译:Speech Super Resolution Generative Adversarial Network
  4. 万物互联大数据研究正式展开人才紧缺
  5. 利用opencv带你玩转人脸识别-中篇(人脸检测,检测多个,视频检测快速入门)
  6. 层次分析法(正课1)
  7. 监控存储方式有哪些?IP-SAN、CVR、与NVR哪种好?
  8. java研发网页数据采集
  9. 运动理想型耳机是什么样子——飞利浦骨传导运动耳机A6606有料分享
  10. Bios工程师手边事—SBIOS添加EC功能