
This article is a continuation of a previous article which kick-started the journey to learning Python for data analysis. You can check out the previous article here: Pandas for Newbies: An Introduction Part I.

本文是上一篇文章的延续,该文章开始了学习Python进行数据分析的旅程。 您可以在此处查看上一篇文章: 新手熊猫:简介第一部分 。

For those just starting out in data science, the Python programming language is a pre-requisite to learning data science so if you aren’t familiar with Python go make yourself familiar and then come back here to start on Pandas.


You can start learning Python with a series of articles I just started called Minimal Python Required for Data Science.

您可以从我刚刚开始的一系列文章开始学习Python,这些文章称为“数据科学所需的最小Python” 。

As a reminder, what I’m doing here is a brief tour of just some of the things you can do with Pandas. It’s the deep-dive before the actual deep-dive.

提醒一下,我在这里所做的只是对熊猫可以做的一些事情的简要介绍。 这是真正的深潜之前的深潜。

Both the data and the inspiration for this series comes from Ted Petrou’s excellent courses on Dunder Data.

数据和本系列的灵感都来自Ted Petrou的Dunder Data精品课程。

先决条件 (Prerequisites)

  1. PythonPython
  2. pandas大熊猫
  3. Jupyter朱皮特

You’ll be ready to begin once you have these three things in order.


聚合 (Aggregation)

We left off last time with the pandas query method as an alternative to regular filtering via boolean conditional logic. While it does have its limits, the query is a much more readable method.

上一次我们没有使用pandas query方法,而是通过布尔条件逻辑进行常规过滤的替代方法。 尽管确实有其限制,但query是一种更具可读性的方法。

Today we continue with aggregation which is the act of summarizing data with a single number. Examples include sum, mean, median, min and max.

今天,我们继续进行汇总,这是用单个数字汇总数据的操作。 示例包括总和,均值,中位数,最小值和最大值。

Let’s try this on different dataset.


Get the mean by calling the mean method.


students.mean()math score       66.089reading score    69.169writing score    68.054dtype: float64

User the axis parameter to calculate the sum of all the scores (math, reading, and writing) across rows:


scores = students[['math score', 'reading score', 'writing score']]scores.sum(axis=1).head(3)0    2181    2472    278dtype: int64

非汇总方法 (Non-aggregating methods)

Perform calculations on the data that do not necessarily aggregate the data. I.E. the round method:

对不一定要汇总数据的数据执行计算。 IE的round方法:

scores.round(-1).head(3)math score  reading score  writing score0          70             70             701          70             90             902          90            100             90

组内汇总 (Aggregating within groups)

Let’s get the frequency of unique values in a single column.


students['parental level of education'].value_counts()some college          226associate's degree    222high school           196some high school      179bachelor's degree     118master's degree        59Name: parental level of education, dtype: int64

Use the groupby method to create a group and then apply and aggregation. Here we get the mean math scores for each gender:

使用groupby方法创建一个组,然后应用和聚合。 在这里,我们获得了每种性别的平均数学成绩:

students.groupby('gender').agg(    mean_math_score=('math score', 'mean'))mean_math_scoregenderfemale        63.633205male          68.728216

多重聚合 (Multiple aggregation)

Here we do multiple aggregations at the same time.


students.groupby('gender').agg(    mean_math_score=('math score', 'mean'),    max_math_score=('math score', 'max'),    count_math_score=('math score', 'count'))

We can create groups from more than one column.


students.groupby(['gender', 'test preparation course']).agg(    mean_math_score=('math score', 'mean'),    max_math_score=('math score', 'max'),    count_math_score=('math score', 'count'))

It looks like students who prepped for test for both sexes scored higher than those who didn’t.


数据透视表 (Pivot Table)

A better way to present information to consumers of information would be to use the pivot_table function which does the same thing as groupby but makes use of one of the grouping columns as the new columns.


Again, it’s the same information presented in a more readable and intuitive format.


数据整理 (Data Wrangling)

Let’s bring a new dataset to examine datasets with missing values


providing the na_values argument will mark the NULL values in a dataset as NaN (Not a Number).


You might also be confronted with a dataset where all the columns should all be part of one column.


We can use the melt method to stack columns one after another.


合并数据集 (Merging Datasets)

Knowing a little SQL will come in handy when studying this part of the pandas library.


There are multiple ways to join data in pandas, but the one method you should definitely get comfortable with is the merge method which connects rows in DataFrames based on one or more keys. It’s basically an implementation of SQL JOINS.

在熊猫中联接数据有多种方法,但是您绝对应该习惯的一种方法是merge方法,该方法基于一个或多个键连接DataFrames中的行。 它基本上是SQL JOINS的实现。

Let’s say I had the following data from a movie rental database:


To perform an “INNER” join using merge :

要使用merge执行“ INNER” merge

The SQL (PostgreSQL) equivalent would be something like:


SELECT * FROM customerINNER JOIN payment ON payment.customer_id = customer.customer_idORDER BY customer.first_name ASCLIMIT 5;

时间序列分析 (Time Series Analysis)

The name pandas is actually derived from Panel Data Analysis which combines cross-sectional data with time-series used most widely in medical research and economics.


Let’s say I had the following data where I knew it was time-series data, but without a DatetimeIndex specifying it as a time-series:


p      a0  0.749  28.961  1.093  67.812  0.920  55.153  0.960  78.624  0.912  60.15

I can simply set the index as a DatetimeIndex with:


Which results in:


p       a1986-12-31  0.749   28.961987-12-31  1.093   67.811988-12-31  0.920   55.151989-12-31  0.960   78.621990-12-31  0.912   60.151991-12-31  1.054   45.541992-12-31  1.079   33.621993-12-31  1.525   44.581994-12-31  1.310   41.94

Here we have a dataset where p is the dependent variable and a is the independent variable. Before running an econometric model called AR(1) we’d have to lag the dependent variable to deal with autocorrelation which we could do using:

在这里,我们有一个数据集,其中p是因变量,而a是自变量。 在运行称为AR(1)的计量经济学模型之前,我们必须将因变量滞后以处理自相关,我们可以使用以下方法进行处理:

p      a  p_lagged1986-12-31  0.749  28.96       NaN1987-12-31  1.093  67.81     0.7491988-12-31  0.920  55.15     1.0931989-12-31  0.960  78.62     0.9201990-12-31  0.912  60.15     0.960

可视化 (Visualization)

The combination of matplotlib and pandas allows us to make rudimentary simple plots in the blink of an eye:


# Using the previous datasetbangla.plot();

That concludes our brief bus tour of the pandas toolbox for data analysis. There’s a lot more that we’ll dive into for my next series of articles. So stay tuned!

到此为止,我们简要介绍了熊猫工具箱进行数据分析的过程。 在我的下一系列文章中,我们将涉及更多内容。 敬请期待!

我做的事 (What I do)

I help people find mentors, code in Python, and write about life. If you’re thinking about switching careers into the tech industry or just want to talk you can sign up for my Slack Channel via VegasBlu.

我帮助人们找到导师,用Python编写代码,并撰写有关生活的文章。 如果您正在考虑将职业转向科技行业,或者只是想谈谈,可以通过VegasBlu注册我的Slack频道。

翻译自: https://towardsdatascience.com/pandas-for-newbies-an-introduction-part-ii-9f69a045dd95




  • 数据分析 绩效_如何在绩效改善中使用数据分析
  • 您一直在寻找5+个简单的一线工具来提升Python可视化效果
  • 产品观念:更好的捕鼠器_故事很重要:为什么您需要成为更好的讲故事的人
  • 面向Tableau开发人员的Python简要介绍(第2部分)
  • netflix_Netflix的计算因果推论
  • 高斯金字塔 拉普拉斯金字塔_金字塔学入门指南
  • 语言认知偏差_我们的认知偏差正在破坏患者的结果数据
  • python中定义数据结构_Python中的数据结构。
  • plotly django_使用Plotly为Django HTML页面进行漂亮的可视化
  • 软件工程方法学要素含义_日期时间数据的要素工程
  • 数据湖 data lake_在Data Lake中高效更新TB级数据的模式
  • ai对话机器人实现方案_显然地引入了AI —无代码机器学习解决方案
  • 图片中的暖色或冷色滤色片是否会带来更多点击? —机器学习A / B测试
  • 图卷积 节点分类_在节点分类任务上训练图卷积网络
  • 回归分析预测_使用回归分析预测心脏病。
  • aws spark_使用Spark构建AWS数据湖时的一些问题以及如何处理这些问题
  • 数据科学家编程能力需要多好_我们不需要这么多的数据科学家
  • sql优化技巧_使用这些查询优化技巧成为SQL向导
  • 物种分布模型_减少物种分布建模中的空间自相关
  • 清洁数据ploy n_清洁屋数据
  • 基于边缘计算的实时绩效_基于绩效的营销中的三大错误
  • 上凸包和下凸包_使用凸包聚类
  • 决策树有框架吗_决策框架
  • mysql那本书适合初学者_3本书适合初学者
  • 阎焱多少身价_2020年,数据科学家的身价是多少?
  • 卡尔曼滤波滤波方程_了解卡尔曼滤波器及其方程
  • 朴素贝叶斯分类器 文本分类_构建灾难响应的文本分类器
  • Seaborn:Python
  • 销货清单数据_2020年8月数据科学阅读清单
  • 米其林餐厅 盐之花_在世界范围内探索《米其林指南》


  1. 熊猫分发_熊猫新手:第一部分

    熊猫分发 For those just starting out in data science, the Python programming language is a pre-requisite ...

  2. 熊猫分发_熊猫cut()函数示例

    熊猫分发 1.熊猫cut()函数 (1. Pandas cut() Function) Pandas cut() function is used to segregate array element ...

  3. 熊猫分发_熊猫实用指南

    熊猫分发 什么是熊猫? (What is Pandas?) Pandas is an open-source data analysis and manipulation tool for Pytho ...

  4. 熊猫分发_熊猫重命名列和索引

    熊猫分发 Sometimes we want to rename columns and indexes in the Pandas DataFrame object. We can use pand ...

  5. 熊猫分发_熊猫下降列和行

    熊猫分发 1. Pandas drop()函数语法 (1. Pandas drop() Function Syntax) Pandas DataFrame drop() function allows ...

  6. 熊猫分发_与熊猫度假

    熊猫分发 While working on a project recently, I had to work with time series data spread over a year. I ...

  7. 熊猫数据集_熊猫迈向数据科学的第一步

    熊猫数据集 I started learning Data Science like everyone else by creating my first model using some machi ...

  8. 熊猫分发_实用熊猫指南

    熊猫分发 Pandas is a very powerful and versatile Python data analysis library that expedites the data an ...

  9. 熊猫分发_流利的熊猫

    熊猫分发 Let's uncover the practical details of Pandas' Series, DataFrame, and Panel 让我们揭露Pandas系列,DataF ...


  1. 【Paddy】数据库监控系列(一) - 监控理念
  2. 关于出现org.hibernate.TransientObjectException: The given object has a null identifier: 错误的解决方法
  3. recyclerview 滑动到当前_Android recyclerview的滑动到指定的item
  4. DWR的学习文档(Hello World,类型转换,Spring,Annotation)
  5. PS2019进阶笔记(二)
  6. Java 8的烹调方式–拼图项目
  7. 无监督学习 | PCA 主成分分析之客户分类
  8. python里的装饰器
  9. 数据仓库 和挖掘的步骤 - oracle
  10. Adobe reader 在打开时如何恢复上一次阅读位置
  11. cgi web页面传入命令
  12. 全网首发:使用安卓MediaCodec Encoder进行编码时的方向问题
  13. 第六章 梯度下降法 学习笔记 上
  14. 超简单的ubuntu18.04安装teamview
  15. python数据建模与预测_Python建模复习:预测型数据挖掘
  16. JAVA判断访问设为是否为手机、苹果、微信
  17. SSRF学习(5)gopher协议上传文件
  18. 洛谷T30768 动感超人520
  19. 武科大计算机科学与技术教务处,武科大教-武汉科技大学教务处.PDF
  20. HIVE 多个相同属性字段元素合并到一列中


  1. socket通信和异常处理札记
  2. C++虚继承中构造函数和析构函数顺序问题以及原理
  3. MSG_PEEK标志
  4. 详谈P(查准率),R(查全率),F1值
  5. STM32F013 十元板
  6. 蓝桥杯java 基础练习 十六进制转十进制
  7. 基于ssm框架和freemarker的商品销售系统
  8. 关于小程序的一些坑的总结
  9. wordpress 基础文件
  10. 转:(图文并茂)SQL Server 2005详细安装过程及配置