pandas内置绘图

Each and every data scientist is using the very famous libraries for data manipulation that is Pandas, built on top of the Python programming language. It is a powerful python package that makes importing, cleaning, analyzing, and exporting the data easier.

每个数据科学家都在使用非常著名的Pandas库进行数据处理,该库基于Python编程语言构建。 它是一个功能强大的python软件包,可简化导入,清理,分析和导出数据的过程。

In a nutshell, Pandas is like excel for Python, with tables (which in pandas are called DataFrames), rows and columns (which in pandas are called Series), and many functionalities that make it an awesome library for processing and data inspection and manipulation.

简而言之,Pandas就像Python的excel一样,具有表(在Pandas中称为DataFrames),行和列(在Pandas中称为Series)以及许多功能,使其成为处理,数据检查和操作的出色库。

Sharing some of the great insights and hacks in pandas which makes data analysis more fun and handy.

分享大熊猫的一些见解和技巧,从而使数据分析变得更加有趣和便捷。

import pandas as pd

While reading the dataframe many times we face the problem that a complete set of rows are not visible, thus analyzing the data becomes quite difficult. So pandas provide the function as “set_options” which help us to define the maximum number of counts of rows to be displayed.

在多次读取数据帧时,我们面临的问题是看不到完整的行集,因此分析数据变得非常困难。 因此,熊猫提供了“ set_options ”功能,可以帮助我们定义要显示的最大行数。

pd.set_option('display.max_rows', 500)pd.set_option('display.max_columns', 500)

1.导入数据 (1. Importing the data)

Here are the different formats of data that can be imported by using pandas read functionality.

这是可以使用熊猫读取功能导入的数据的不同格式。

Csv, Excel, Html, Binary files, Pickle file, Json file, SQL query

Csv,Excel,HTML,二进制文件,Pickle文件,Json文件,SQL查询

The format which is most commonly used for machine learning is CSV i.e (comma separated file) and every data scientist encounters a CSV file on a daily basis, so we would restrict it to a CSV file.

机器学习最常用的格式是CSV,即(逗号分隔的文件),每个数据科学家每天都会遇到CSV文件,因此我们将其限制为CSV文件。

Some of the important key arguments while reading CSV file in pandas are

在熊猫中读取CSV文件时,一些重要的关键参数是

  • delimiter: a blank space, comma, or other character or symbol that separates different cells in a row

    定界符:空格,逗号或其他字符或符号,用于分隔行中的不同单元格

  • header: a row which is to be used as column name

    标头:将用作列名的行

  • index_col: columns to be used as a row labels

    index_col:用作行标签的列

  • usecol: the name of columns to be used while reading the file if provided only a subset of the file will be read

    usecol:如果仅读取文件的一个子集,则在读取文件时使用的列名

  • skiprows: number of lines to be skipped, generally if a file has a blank line or unnecessary content which need to skip

    skiprows:要跳过的行数,通常在文件中包含空白行或需要跳过的不必要内容的情况下

pd.read_csv(filepath, delimiter, header, names, index_col, usecol, skiprows, parse_dates, keep_date_col, chunksize)pd.read_csv("data.csv", header= 

Once the data is imported, it is known as the dataframe in Pandas definition.

数据导入后,在Pandas定义中称为数据框。

2. 了解数据 (2. Understanding the data)

df = pd.DataFrame({

"Company": ["Kia", "Hyundai", "Hyundai", "Hyundai",      "Hyundai","Honda","Honda", "Honda", "Honda", "Kia"],

"Segment": ["premium", "budget", "luxury", "premium", "budget","premium", "budget", "budget", "premium", "luxury"],

"Type": ["large", "small", "large", "small","small", "large", "small", "small", "large", "large"],

"CrashRating": [4.5, 2.5, 4, np.nan, 3, 4, 3, 4.2, 4.5, 4.2],

"CustomerFeedback": [9, 7, 5, 5, 8, 5.6, np.nan, 9, 9, 4.8]})

We can check the first n rows or last n rows of raw data by using the function “head(number of rows)”

我们可以使用“ head(行数)”功能检查原始数据的前n行或后n行

df.head(5) #here 5 is first 5 items in the dataframedf.tail(5) #displays the bottom 5 rows of dataframe

This data is in the format of the table which is the same as visualized in excel or any other CSV reader. To interact more with data lets see some useful inbuilt functions.

此数据采用表格格式,与在excel或任何其他CSV阅读器中显示的格式相同。 要与数据进行更多交互,请看一些有用的内置函数。

  • info: This function provides the summary of the dataframe, that are number of rows, number of columns, name of each column along with the number of null values in that column and the type of data in that column

    info:此函数提供数据帧的摘要,即行数,列数,每列的名称以及该列中的空值数量和该列中的数据类型

df.info()
  • describe: used to analyze the data statistically, and thus only returns results for numerical columns in the dataframe. Returns the table comprises count, mean, standard deviation, minimum value, maximum value, and quantile values which are useful to detect outliers and see the distribution of data.

    describe:用于统计分析数据,因此仅返回数据框中数字列的结果。 返回的表包含计数,平均值,标准偏差,最小值,最大值和分位数,这些值可用于检测异常值和查看数据分布。

df.describe()
  • memoryusage: used to understand the memory usage of each column in bytes

    memoryusage:用于了解每列的内存使用情况(以字节为单位)

df.memoryusage()
  • dtype: to analyze the datatype of each column within the dataframe. Returns a series with a data type of each column

    dtype:分析数据框中每个列的数据类型。 返回具有每一列数据类型的序列

df.dtypes
  • isnull or isna: this function is used to calculate the missing values in the dataframe when used independently returns a bool (True or False) indicating if the value is NA, we can use it in multiple manners

    isull或isna:此函数用于在数据帧中单独使用时返回布尔值(True或False)以指示该值是否为NA时,计算数据框中的缺失值,我们可以以多种方式使用它

df.isnull().sum() #count the number of missing values in each columndf.isnull().mean()*100 #return the percentage of missing values in each column
  • unique: when we need to count the unique number of values in one pandas series (i.e. in one specific column of dataframe). Generally used to analyze categorical columns.

    唯一:当我们需要计算一个熊猫系列中唯一值的数量时(即在数据框的特定列中)。 通常用于分析分类列。

df[col].unique()
  • shape: used to define the dimensionality of the dataframe

    形状:用于定义数据框的尺寸

df.shape

3. 探索数据 (3. Exploring the Data)

Now that we have loaded our data into a DataFrame and understood its structure, let’s pick and perform visualizations on the data. When it comes to selecting your data, you can do it with both Indexes or based on certain conditions. In this section, let’s go through each one of these methods and do some exploratory analysis.

现在,我们已经将数据加载到DataFrame中并了解了其结构,现在让我们选择数据并对其进行可视化处理。 在选择数据时,可以同时使用两个索引或根据特定条件来执行。 在本节中,我们将逐一介绍这些方法中的每一种,并进行一些探索性分析。

  • Selecting the Columns

    选择列

Set of columns which we need to analyze can be select in the following way

我们需要分析的一组列可以通过以下方式选择

df[['Company', 'Type']]df.loc[:,['Company', 'Type']]df.iloc[:,[0,1]]
  • Selecting the Rows

    选择行

Selecting the specific rows for analysis can be achieved in the following manner

选择要分析的特定行可以通过以下方式实现

df.iloc[[1,2], :]df.loc[[1,2], :]
  • Selecting the specific type of columns

    选择特定的列类型

Sometimes it is helpful to select the subset of the column having specific data types than this function can be used

有时选择列的子集会有所帮助 可以使用具有特定于此功能的数据类型

df.select_dtypes(include=['object'])
  • Selecting both rows and columns

    选择行和列

Most of you are curious to understand that is pandas so week that only one index can be selected at a time either set of rows or columns, no we can select a subset of rows and column at a single time

你们中的大多数人很好奇地理解这是大熊猫,所以一周只能一次选择一组行或列的索引,不,我们不能一次选择行和列的子集

df.iloc[0:2][['Segment', 'Type']]df.iloc[0:2,1:3]
  • Applying filter

    应用过滤器

Now, in a real-time scenario, selecting the particular number of rows based on the indexes is quite tough. So the actual real-life requirement would be to filter out the rows that satisfy a certain condition. With respect to our dataset, we can filter by any of the following conditions

现在,在实时情况下,根据索引选择特定的行数非常困难。 因此,实际的实际需求是过滤出满足特定条件的行。 对于我们的数据集,我们可以通过以下任意条件进行过滤

df[df['Type']=='large']df[(df['Type']=='large') & (df['Segment']=='luxury')]

4.处理和转换数据 (4. Handling and Transforming the Data)

After doing basic exploration analysis on data, now it’s time to handle missing values and transform the data to perform some advanced data exploration.

在对数据进行了基本的探索分析之后,现在是时候处理缺失值并转换数据以执行一些高级数据探索了。

  • Missing data handling

    缺少数据处理

Handling missing values is one the trickest and crucial part of data manipulation because replacing the missing cells reflects the change in the distribution of data. Depending on the characteristics of the dataset and the task we can choose to

处理丢失的值是数据操作中最棘手且至关重要的部分之一,因为替换丢失的单元格反映了数据分布的变化。 根据数据集的特征和我们可以选择的任务

  • Drop missing values: We can drop a row or column having missing values. Scenarios where more than 40% of the column have missing values than that whole column is dropped from the analysis. Dropping results in eliminating the entire row from the observation thus reducing the size of the dataframe.

    删除缺失值:我们可以删除具有缺失值的行或列。 场景中超过40%的列缺少值的情况比从分析中删除整个列的情况。 删除导致从观察中消除了整个行,​​从而减小了数据帧的大小。

  • Replace missing values: Depending upon the distribution of the column we can replace the missing values with a special value or an aggregate value such as mean, median, or any other dynamic value which could be average of similar observations. For time-series data missing values are generally replaced with a window of values before and after the observation.

    替换缺失值 :根据列的分布,我们可以将缺失值替换为特殊值或合计值,例如平均值,中位数或任何其他可能是相似观察值的平均值的动态值。 对于时间序列数据,通常将缺失值替换为观察前后的值窗口。

df[‘CrashRating’].fillna(df[‘CrashRating’].mean())df.fillna(axis=0, method = ‘ffill’, limit =1)
  • Drop column or rows

    删除列或行

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names.

通过指定标签名称和相应的轴,或直接指定索引或列名称,删除行或列。

Dropping the column is helpful in the labeled dataframe where we want to remove y_true from training and test data.

在要从训练和测试数据中删除y_true的带标签数据框中,删除该列会很有帮助。

df.drop(['CrashRating'], axis=1)df.drop([0,1], axis=0)
  • Group By

    通过...分组

In many situations, we split the data into sets like grouping records into buckets by categorical values and apply some functionality on each subset. In the apply functionality, we can perform the following operations:

在许多情况下,我们将数据分成几组,例如通过分类值将记录分组到存储桶中,并对每个子集应用某些功能。 在应用功能中,我们可以执行以下操作:

  • Aggregation − computing a summary statistic聚合-计算摘要统计
  • Transformation − perform some group-specific operation转换-执行一些特定于组的操作
  • Filtration − discarding the data with some condition过滤-在某些条件下丢弃数据
df.groupby(['Company','Segment']).mean()
  • Pivot table

    数据透视表

This is much similar to “groupby” which is also composed of counts, sums, or other aggregations derived from a table of data. You may have used this feature in spreadsheets, where you would choose the rows and columns to aggregate on, and the values for those rows and columns. It allows us to summarize data as grouped by different values, including values in categorical columns.

这与“ groupby”非常相似,“ groupby”也由从数据表派生的计数,总和或其他聚合组成。 您可能在电子表格中使用了此功能,可以在其中选择要汇总的行和列,以及这些行和列的值。 它使我们可以按不同值(包括分类列中的值)对数据进行汇总。

Pivot table function in pandas takes certain arguments as input:

pandas中的数据透视表函数将某些参数作为输入:

  • index, columns

    索引

  • values = the name of the column of values to be aggregated in the ultimate table, then grouped by the Index and Columns and aggregated according to the Aggregation Function

    values =要在最终表中聚合的值列的名称,然后按索引和列分组,并根据聚合函数进行聚合

  • aggfunc= (Aggregation Function) how rows are summarized, such as sum, mean, or count

    aggfunc =(聚合函数)如何汇总行,例如求和,均值或计数

df.pivot_table(index=['Company', 'Type'], columns=['Segment'], values=['CrashRating'], aggfunc='mean')
  • Merge and Concatenation

    合并与串联

When importing data from multiple files in a separate dataframe it becomes necessary to concat, merge, or join such files into one.

从一个单独的数据框中的多个文件导入数据时,有必要将这些文件合并,合并或合并为一个文件。

  • concat() — performs all the concatenation operations along an axis while performing optimal set logic (union or intersection) of the indexes

    concat() —沿轴执行所有串联操作,同时执行索引的最佳设置逻辑(联合或交集)

df_new = pd.DataFrame({    "Company": [ "Hyundai","Honda", "Honda", "Honda", "Kia"],    "Segment": ["premium", "budget", "luxury", "luxury", "luxury"],    "Type": ["large", "small", "large", "large", "large"],    "CrashRating": [3.8, 3.5, 4, 4.2, 3],    "CustomerFeedback": [8, 7, 7, 6, 7.5 ]})df_result = pd.concat([df, df_new])df_result = pd.concat([df, df_new], keys=[‘old’,’new’])
  • merge() — pandas have in-memory join operations very similar to relational databases like SQL, in SQL where we use “join” to combine two tables on one common index.

    merge()—大熊猫具有类似于SQL之类的关系数据库的内存中联接操作,在SQL中,我们使用“联接”将两个表组合在一个公共索引上。

df_launingyear = pd.DataFrame({    "Company": [ "Hyundai","Honda", "Honda", "Honda", "Kia"],    "LaunchingYear": [2015, 2018, 2017, 2012, 2019]})pd.merge(df, df_launingyear, on='Company')
  • Create dummy variables

    创建虚拟变量

Categorical variables whose type are ‘object’ can not be used as it is for training the ML model, we need to create a dummy variable of that specific column using pandas “get_dummies” function

无法将类型为“对象”的分类变量原样用于训练ML模型,我们需要使用pandas“ get_dummies”函数为该特定列创建一个虚拟变量

pd.get_dummies(df['Company'])

保存数据框 (Saving a dataframe)

After performing exploratory analysis on the dataset we want to store the observations in the form of a new CSV file, which comprises additional information such as table returned by applying pivot_table function or filtering unnecessary details or new dataframe obtained after running concat or merge-operations.

在对数据集执行探索性分析之后,我们希望以新的CSV文件的形式存储观察值,其中包括其他信息,例如通过应用数据透视表功能或过滤不必要的详细信息返回的表,或在运行concat或合并操作后获得的新数据框。

Exporting the results in the form of a CSV file is a simpler step, as we just need to call “to_csv()” function with some arguments which are same as we used while reading data from CSV.

以CSV文件的形式导出结果是一个简单的步骤,因为我们只需要使用一些参数来调用“ to_csv()”函数,这些参数与从CSV读取数据时使用的参数相同。

df.to_csv('./data.csv', index_label=False)

结论 (Conclusion)

In this article, we have listed some general pandas function used to analyzed each dataset which we have gathered while working with Python and Jupyter Notebooks. We are sure these simple hacks will be of use to you and you will take back something from this article. Till then Happy Coding!.

在本文中,我们列出了一些通用的pandas函数,用于分析在使用Python和Jupyter Notebooks时收集的每个数据集。 我们确信这些简单的技巧对您有用,您将从本文中取回一些东西。 直到快乐编码!

Let us know if you like the blog, please do comment for any queries or suggestions and follow us on LinkedIn and Instagram. Your love and support inspire us to post our learning in a much better way..!!

让我们知道您是否喜欢该博客,如有任何疑问或建议,请发表评论,并在LinkedIn和Instagram上关注我们。 您的爱心支持会激励我们以更好的方式发表我们的学习经验。

翻译自: https://medium.com/@datasciencewhoopees/exploring-datasets-with-pandas-inbuilt-functionality-6c322c0cdd7d

pandas内置绘图


http://www.taodudu.cc/news/show-863382.html

相关文章:

  • sim卡rfm_信用卡客户的RFM集群
  • 需求分析与建模最佳实践_社交媒体和主题建模:如何在实践中分析帖子
  • 机器学习 数据模型_使用PyCaret将机器学习模型运送到数据—第二部分
  • 大数据平台蓝图_数据科学面试蓝图
  • 算法竞赛训练指南代码仓库_数据仓库综合指南
  • 深度学习 图像分类_深度学习时代您应该阅读的10篇文章了解图像分类
  • 蝙蝠侠遥控器pcb_通过蝙蝠侠从Circle到ML:第一部分
  • cnn卷积神经网络_5分钟内卷积神经网络(CNN)
  • 基于树的模型的机器学习
  • 数据分析模型和工具_数据分析师工具包:模型
  • 图像梯度增强_使用梯度增强机在R中进行分类
  • 机器学习 文本分类 代码_无需担心机器学习-如何在少于10行代码中对文本进行分类
  • lr模型和dnn模型_建立ML或DNN模型的技巧
  • 数量和质量评价模型_数量对于语言模型可以具有自己的质量
  • mlflow_使用MLflow跟踪进行超参数调整
  • 聊天产生器
  • 深度学习领域专业词汇_深度学习时代的人文领域专业知识
  • 图像分类
  • CSDN-Markdown基本语法
  • python3(一)数字Number
  • python3(二)Numpy
  • python3(三)Matplotlib
  • python3(四)Pandas库
  • python3(六)监督学习
  • pycharm中如何调用Anoconda的库
  • TensorFlow(四)优化器函数Optimizer
  • TensorFlow(三)常用函数
  • TensorFlow(五)常用函数与基本操作
  • TensorFlow(六)with语句
  • Pycharm如何选择自动打开最近项目

pandas内置绘图_使用Pandas内置功能探索数据集相关推荐

  1. pandas数据相关性分析_使用Pandas,SciPy和Seaborn进行探索性数据分析

    pandas数据相关性分析 In this post we are going to learn to explore data using Python, Pandas, and Seaborn. ...

  2. Java太内卷了_程序员内卷已成常态?java开发该何去何从

    前言 内卷一个词最近一直很火,什么是内卷?百度了一下,上面是这么解释的: 是不是感觉有点听不懂,举个例子,某个事业单位招一个保安,本来随便一个小学文化的人都能胜任,但是因为这个单位福利好,想来当保安的 ...

  3. 块元素、行内块和内联元素_如何删除内联块元素之间的空间?

    块元素.行内块和内联元素 Introduction: 介绍: This question has become rather popular. How does one remove whitespa ...

  4. pandas 选取第一行_用pandas中的DataFrame时选取行或列的方法

    如下所示: import numpy as np import pandas as pd from pandas import Sereis, DataFrame ser = Series(np.ar ...

  5. css内联样式_如何覆盖内联CSS样式

    css内联样式 本文写于2009年,至今仍是我们最受欢迎的帖子之一. 如果您想了解有关CSS的更多信息,您可能会发现这篇有关CSS技术的文章对视网膜显示非常感兴趣. CSS的最强大功能之一就是层叠. ...

  6. python如何下载pandas、时间延长_大pandas,python – 如何在时间表中选择具体时间

    这是一个你想要的例子: In [32]: from datetime import datetime as dt In [33]: dr = p.DateRange(dt(2009,1,1),dt(2 ...

  7. 易语言json置入_易语言取置JSON文本使用方法-易语言学习-猴子技术宅

    什么是JSON? JSON是一种取代XML的数据结构,和xml相比,它更小巧但描述能力却不差,由于它的小巧所以网络传输数据将减少更多流量从而加快速度. JSON到底是什么? JSON就是一串字符串 只 ...

  8. Java怎么做置顶_[Java教程]自定义置顶TOP按钮

    [Java教程]自定义置顶TOP按钮 0 2015-12-10 22:00:13 简述一下,分为三个步骤: 1. 添加Html代码2. 调整Css样式3. 添加Jquery代码具体代码如下: #GoT ...

  9. python干啥用_用python内置函数能干些什么?

    内置函数列表 说明:仅选用built-in function类型的独立内置函数,而非内置对象.__build_class__和__import__由于制表的时候未进行转义,所以表单中忽略了下划线. 例 ...

最新文章

  1. 在图片中如何生成带有文字边缘空心字体?
  2. python 常见内置函数setattr、getattr、delattr、setitem、getitem、delitem
  3. 为啥总让我“先去博客园其他网站逛逛”?
  4. SpringBoot笔记:SpringBoot启动参数配置
  5. Redis系列教程(四):Redis为什么是单线程、及高并发快的3大原因详解
  6. 干支纪年法简便算法_@谢氏宗亲:可知道我国为何放弃黄帝纪年,而选择耶稣诞辰纪年法...
  7. 超详细的MySQL工作原理 体系结构
  8. Ubuntu gerrit 安装配置
  9. 大数据在各行业中的应用表现
  10. P1955 [NOI2015] 程序自动分析
  11. Jupyter Notebook 快速入门
  12. Flutter- Android项目集成flutter模块
  13. ac3168无线网卡驱动下载_REALTEK芯片无线网卡最新驱动!支持到10.15
  14. MS08067红队攻防班 第五期开班啦!(2021年最后一期)
  15. linux 内核kenel优化方案一 -O3编译 Makefile
  16. QT中更改主窗体背景色和背景图片
  17. 转载:关于调制比、过调制、基波电压和母线电压的概念和关系总结
  18. 大学计算机基础 教材建设,《大学计算机基础》课程建设与教材编写.pdf
  19. 2022年5月网络教育大学英语B统考题库复习题及考试时间
  20. BigDecimal 比较大小

热门文章

  1. 用php+ajax+echarts.js 实现统计每分钟答题曲线图
  2. HDU 4911 Inversion 树状数组求逆序数对
  3. ElasticSearch vs Solr多维度分析对比
  4. supervisor进程管理工具
  5. MySQL笔记之视图的使用详解
  6. extjs 页面打开时表格自动加载后台传来的json数据
  7. 内存不能“read”
  8. 广电宽带业务发展的四种策略
  9. 只接受数字的文本框(翻译)
  10. REST架构下,浏览器怎么发送put与delete请求?