熊猫分发_实用熊猫指南

熊猫分发

Pandas is a very powerful and versatile Python data analysis library that expedites the data analysis and exploration process. The best way to learn the functions and methods offered by pandas is practicing.

Pandas是一个功能强大且用途广泛的Python数据分析库，可加快数据分析和探索过程。学习熊猫提供的功能和方法的最好方法是练习。

Practice makes perfect.

实践使完美。

In this post, we will do lots of examples to explore various capabilities of pandas. We will do both simple and advanced examples to see what pandas is capable of.

在这篇文章中，我们将做很多例子来探索熊猫的各种功能。我们将通过简单示例和高级示例来了解熊猫的功能。

As always, we start with importing numpy and pandas.

与往常一样，我们从导入numpy和pandas开始。

import pandas as pdimport numpy as np

Let’s first create a sample dataframe to work on. We can pass a dictionary to the DataFrame function of pandas.

让我们首先创建一个示例数据框以进行处理。我们可以将字典传递给pandas的DataFrame函数。

df = pd.DataFrame({'num_1':np.random.random(100),                   'num_2':np.random.random(100),                   'num_3':np.random.randint(0,5,100),                   'num_4':np.random.randint(0,100,100)})df.head()

We’ve used numpy arrays to create numerical columns. Let’s also add categorical columns to our dataframe.

我们使用numpy数组创建数字列。让我们还将分类列添加到我们的数据框中。

from random import samplename = ['Linda','John','Ashley','Xavi','Betty','Mike'] * 100cat = ['A','B','C','D'] * 100names = sample(name, 100)cats = sample(cat, 100)

The lists “names” and “cats” contain 100 randomly selected samples from the longer lists “name” and “cat”. We used sample function from random module of python.

列表“名称”和“猫”包含从较长列表“名称”和“猫”中随机选择的100个样本。我们使用了来自python 随机模块的样本函数。

It is time add these two categorical features to the dataframe.

现在是时候将这两个分类功能添加到数据框中了。

df.insert(0, 'names', names)df['cats'] = catsdf.head()

We added two new columns with two different ways. df[‘col’] = col adds the new column at the end. We can specify the location of new column using insert function as we did with “names” column.

我们以两种不同的方式添加了两个新列。 df ['col'] = col在末尾添加新列。我们可以像使用“名称”列那样使用插入函数指定新列的位置。

Consider we are interested in the rows in which “num_1” is greater than “num_2”. The following two lines of codes accomplish this task and displays the first five rows.

考虑我们对“ num_1”大于“ num_2”的行感兴趣。以下两行代码完成了此任务并显示了前五行。

df[df.num_1 > df.num_2][:5]df.query('num_1 > num_2')[:5]

You can use either one but I prefer query function. I think it is simpler in case of more complex filters. If we want to see the comparison of “num_1” and “num_2” based on different categories(“cats”), we can apply groupby function on filtered dataframe.

您可以使用任何一个，但我更喜欢查询功能。我认为在使用更复杂的过滤器的情况下更简单。如果要查看基于不同类别(“猫”)的“ num_1”和“ num_2”的比较，可以对过滤的数据帧应用groupby函数。

df1 = df.query('num_1 > num_2') [['cats','names']].groupby('cats').count().reset_index()df1

The column “names” is irrelevant here and just selected to be able to count rows. It seems like categories C and D have more rows in which “num_1” is higher than “num_2”. However, these numbers do not make sense unless we know how many rows each category has in the entire dataframe.

列“名称”在这里无关紧要，只是为了能够计算行而选择。似乎类别C和D具有更多的行，其中“ num_1”高于“ num_2”。但是，除非我们知道每个类别在整个数据框中具有多少行，否则这些数字没有意义。

ser = df['cats'].value_counts()df2 = pd.concat((df1,ser), axis=1)df2.rename(columns={'names':'num_1>num_2', 'cats':'total'}, inplace=True)df2

We have created a series that contains the count of each category in cats column using value_counts. Then we combined df1 and ser with concat function and renamed the columns. The final dataframe df2 shows the total number of rows for each category as well as the number of rows that fits the filtering argument.

我们使用value_counts创建了一个包含cats列中每个类别的计数的系列。然后，我们将df1和ser与concat函数结合在一起，并重命名了列。最终数据框df2显示每个类别的总行数以及适合过滤参数的行数。

Assume we want to see the average value in “num_4” for each category in “cats” but we want these values for only a few names. In this case, we can use isin method for filtering and then apply groupby.

假设我们希望看到“猫”中每个类别的平均值“ num_4”，但我们只希望这些名称只包含几个名称。在这种情况下，我们可以使用isin方法进行过滤，然后应用groupby 。

name = ['Ashley','Betty','Mike']df[df.names.isin(name)][['cats','num_4']].groupby('cats').mean()

If the number of occurrences is also needed, we can apply multiple aggregate functions on groupby.

如果还需要出现次数，则可以在groupby上应用多个聚合函数。

name = ['Ashley','Betty','Mike']df[df.names.isin(name)][['cats','num_4']].groupby('cats').agg(['mean','count'])

We have 4 different measurements stored in 4 different columns. We can combine them in one column and indicate the name of measurement in another column. Here is the original dataframe:

我们在4个不同的列中存储了4个不同的度量。我们可以将它们合并在一个列中，并在另一列中指示测量名称。这是原始数据框：

Melt function can be used to achieve what I just described.

融化功能可用于实现我刚刚描述的功能。

df_melted = df.melt(id_vars=['names','cats'])df_melted.head()

Melt is especially useful when working with wide dataframes (i.e. lots of features). For instance, if we had 100 different measurements (num_1 to num_100), it would be much easier to do analysis on a melted dataframe.

当使用宽数据框(即许多功能)时，融解特别有用。例如，如果我们有100个不同的测量值(从num_1到num_100)，则在融化的数据帧上进行分析会容易得多。

Our dataframe contains measurements so it is likely that we update the dataframe by adding new measurements. Let’s say we update the dataframe with the new_row below.

我们的数据框包含测量值，因此我们很可能会通过添加新的测量值来更新数据框。假设我们使用下面的new_row更新数据框。

new_row = {'names':'Mike', 'num_1':0.678, 'num_2':0.345,           'num_3':3, 'num_4':[68,80], 'cats':'C'}df_updated = df.append(new_row, ignore_index=True)df_updated.tail()

The new_row is added but there is a problem. It contains a couple of values for in “num_4” . We should have them in separate rows. The explode function of pandas can be used for this task.

添加了new_row，但是有问题。它在“ num_4”中包含几个值。我们应该将它们放在单独的行中。熊猫的爆炸功能可用于此任务。

df_updated = df_updated.explode('num_4').reset_index(drop=True)df_updated.tail()

We have separated the values in “num_4”. There might be cases in which a column contains lots of combined values in many rows. Explode function may come handy in those cases.

我们将“ num_4”中的值分开。在某些情况下，一列可能包含许多行中的许多组合值。在这种情况下，爆炸功能可能会派上用场。

There might be cases in which we need to replace some values. Pandas replace function makes it very simple. We can even replace multiple values by passing a dictionary.

在某些情况下，我们需要替换一些值。熊猫替换功能使其非常简单。我们甚至可以通过传递字典来替换多个值。

replacement = {'D':'F', 'C':'T'}df.cats.replace(replacement, inplace=True)df.head()

Let’s assume we need to replace some values based on a condition. The condition is specified as “the values below 0.5 in num_1 are to be replaced with 0”. We can use where function in this case.

假设我们需要根据条件替换一些值。条件指定为“将num_1中的0.5以下的值替换为0”。在这种情况下，我们可以使用where函数。

df['num_1'] = df['num_1'].where(df['num_1'] >= 0.5, 0)df.head()

The way “where” works is that values that fit the condition are selected and the remaining values are replaced with the specified value. where(df[‘num_1’]≥0.5, 0) selects all the values in “num_1” that are greater than 0.5 and the remaining values are replaced with 0.

“ where”的工作方式是选择适合条件的值，然后将剩余的值替换为指定的值。 where(df ['num_1']≥0.5，0)选择“ num_1”中所有大于0.5的值，并将其余值替换为0。

The dataframe contains 4 numerical and 2 categorical features. We may need to see a quick summary on how numerical values change based on categories. Pandas pivot_table function can provide this kind of summary and it is very flexible in terms of display options.

数据框包含4个数字特征和2个分类特征。我们可能需要查看有关数值如何根据类别变化的快速摘要。熊猫的pivot_table函数可以提供这种摘要，并且在显示选项方面非常灵活。

df.pivot_table(index='names', columns='cats', values='num_2', aggfunc='mean', margins=True)

This table shows how num_2 values change according to names-cats combinations. It is better to set margin parameter as True to to see the comparison with the overall value. There are many options for aggfunc parameter such as count, min, max.

下表显示num_2值如何根据名称-猫的组合而变化。最好将margin参数设置为True，以查看与总值的比较。 aggfunc参数有很多选项，例如count，min，max。

A highly common task with dataframes is to handle missing values. The dataframe we created does not have any missing values. Let’s first add some missing values randomly.

数据框最常见的任务是处理缺失值。我们创建的数据框没有任何缺失值。首先让我们随机添加一些缺失值。

a = np.random.randint(0,99,20)df.iloc[a, 3] = np.nan

We created an array with 20 random integers between 0 and 99. Then we used it as index in iloc. 20 rows in the fourth column (column index is 3) are replaced with np.nan which is a missing value representation of pandas.

我们创建了一个数组，其中包含20个介于0到99之间的随机整数。然后将其用作iloc中的索引。第四列中的20行(列索引为3)被替换为np.nan，这是熊猫的缺失值表示。

df.isna().sum() returns the number of missing values in each column.

df.isna()。sum()返回每列中缺失值的数量。

Although we passed in 20 indices, the number of missing values seems to be 15. It is because of the duplicate values in array a.

尽管我们传入了20个索引，但缺失值的数量似乎为15。这是因为数组a中的值重复。

len(np.unique(a))15

We can replace the missing values with fillna function. We can use a constant to replace missing values or a statistic of the column such as mean or median.

我们可以用fillna函数替换缺少的值。我们可以使用常量来替换缺失值或列的统计信息，例如平均值或中位数。

df['num_3'].fillna(df['num_3'].mean(), inplace=True)df.isna().sum()

This is just a small piece of what you can do with Pandas. The more you work with it, the more useful and practical ways you find. I suggest approaching a problem using different ways and never place a limit in your mind.

这只是您对Pandas可以做的一小部分。您使用它的次数越多，发现的方法就越有用和实用。我建议您使用不同的方法来解决问题，切勿在您的脑海中置身事外。

When you are trying hard to find a solution for a problem, you will almost always learn more than the solution of the problem at hand. You will improve your skillset step-by-step to build a robust and efficient data analysis process.

当您努力寻找问题的解决方案时，您几乎总是会比手头的问题解决方案了解更多。您将逐步提高技能，以建立可靠而有效的数据分析过程。

Thank you for reading. Please let me know if you have any feedback.

感谢您的阅读。如果您有任何反馈意见，请告诉我。

翻译自: https://towardsdatascience.com/practical-pandas-guide-b3eedeb3e88

熊猫分发

查看全文

http://www.taodudu.cc/news/show-863509.html

救命代码_救命！如何选择功能？
回归模型评估_评估回归模型的方法
gan学到的是什么_GAN推动生物学研究
揭秘机器学习
投影仪投影粉色_DecisionTreeRegressor —停止用于将来的投影！
机器学习中的随机过程_机器学习过程
ci/cd heroku_在Heroku上部署Dash或Flask Web应用程序。简易CI / CD。
图像纹理合成_EnhanceNet：通过自动纹理合成实现单图像超分辨率
变压器耦合和电容耦合_超越变压器和抱抱面的分类
梯度下降法_梯度下降
学习机器学习的项目_辅助项目在机器学习中的重要性
计算机视觉知识基础_我见你：计算机视觉基础知识
配对交易方法_COVID下的自适应配对交易，一种强化学习方法
设计数据密集型应用程序_设计数据密集型应用程序书评
pca 主成分分析_超越普通PCA：非线性主成分分析
全局变量和局部变量命名规则_变量范围和LEGB规则
dask 使用_在Google Cloud上使用Dask进行可扩展的机器学习
计算机视觉课_计算机视觉教程—第4课
用camelot读取表格_如何使用Camelot从PDF提取表格
c盘扩展卷功能只能向右扩展_信用风险管理：功能扩展和选择
使用OpenCV，Keras和Tensorflow构建Covid19掩模检测器
使用Python和OpenCV创建自己的“ CamScanner”
cnn图像进行预测_CNN方法：使用聚合物图像预测其玻璃化转变温度
透过性别看世界_透过树林看森林
gan神经网络_神经联觉：当艺术遇见GAN
rasa聊天机器人_Rasa-X是持续改进聊天机器人的独特方法
python进阶指南_Python特性工程动手指南
人工智能对金融世界的改变_人工智能革命正在改变网络世界
数据科学自动化_数据科学会自动化吗？
数据结构栈和队列_使您的列表更上一层楼：链接列表和队列数据结构

熊猫分发_实用熊猫指南相关推荐

熊猫分发_与熊猫度假
熊猫分发 While working on a project recently, I had to work with time series data spread over a year. I ...
熊猫分发_熊猫实用指南
熊猫分发什么是熊猫? (What is Pandas?) Pandas is an open-source data analysis and manipulation tool for Pytho ...
熊猫分发_熊猫新手：第一部分
熊猫分发 For those just starting out in data science, the Python programming language is a pre-requisite ...
熊猫数据集_对熊猫数据框使用逻辑比较
熊猫数据集 P (tPYTHON) Logical comparisons are used everywhere. 逻辑比较随处可见 . The Pandas library gives you a ...
熊猫分发_流利的熊猫
熊猫分发 Let's uncover the practical details of Pandas' Series, DataFrame, and Panel 让我们揭露Pandas系列,DataF ...
熊猫分发_熊猫新手：第二部分
熊猫分发 This article is a continuation of a previous article which kick-started the journey to learning ...
熊猫分发_熊猫重命名列和索引
熊猫分发 Sometimes we want to rename columns and indexes in the Pandas DataFrame object. We can use pand ...
熊猫分发_熊猫cut（）函数示例
熊猫分发 1.熊猫cut()函数 (1. Pandas cut() Function) Pandas cut() function is used to segregate array element ...
熊猫分发_熊猫下降列和行
熊猫分发 1. Pandas drop()函数语法 (1. Pandas drop() Function Syntax) Pandas DataFrame drop() function allows ...

熊猫分发_实用熊猫指南

相关文章：

熊猫分发_实用熊猫指南相关推荐

最新文章

热门文章