熊猫分发

Let’s uncover the practical details of Pandas’ Series, DataFrame, and Panel

让我们揭露Pandas系列,DataFrame和Panel的实用细节

Note to the Readers: Paying attention to comments in examples would be more helpful than going through the theory itself.

读者注意:注意示例中的注释比通过理论本身更有用。

· Series (1D data structure: Column-vector of DataTable)· DataFrame (2D data structure: Table)· Panel (3D data structure)

· 系列(1D数据结构:DataTable的列向量) · DataFrame(2D数据结构:Table) · 面板(3D数据结构)

Pandas is a column-oriented data analysis API. It’s a great tool for handling and analyzing input data, and many ML framework support pandas data structures as inputs.

Pandas是面向列的数据分析API。 这是处理和分析输入数据的好工具,许多ML框架都支持熊猫数据结构作为输入。

熊猫数据结构 (Pandas Data Structures)

Refer Intro to Data Structures on Pandas docs.

请参阅 熊猫文档上的 数据结构简介

The primary data structures in pandas are implemented as two classes: DataFrame and Series.

大熊猫中的主要数据结构实现为两类: DataFrame Series

Import numpy and pandas into your namespace:

numpy pandas 导入您的名称空间:

import numpy as npimport pandas as pdimport matplotlib as mplnp.__version__pd.__version__mpl.__version__

系列(一维数据结构:DataTable的列向量) (Series (1D data structure: Column-vector of DataTable))

CREATING SERIES

创作系列

Series is one-dimensional array having elements with non-unique labels (index), and is capable of holding any data type. The axis labels are collectively referred to as index. The general way to create a Series is to call:

系列是 一维数组,具有带有非唯一标签(索引)的元素 ,并且能够保存任何数据类型。 标签 统称为 指标 创建系列的一般方法是调用:

pd.Series(data, index=index)

Here, data can be an NumPy’s ndarray, Python’s dict, or a scalar value (like 5). The passed index is a list of axis labels.

在这里, data 可以是NumPy的 ndarray ,Python的 dict 或标量值(如 5 )。 传递的 index 是轴标签的列表。

Note: pandas supports non-unique index values. If an operation that does not support duplicate index values is attempted, an exception will be raised at that time.

注意: pandas支持 非唯一索引值 如果尝试执行不支持重复索引值的操作,则此时将引发异常。

If data is list or ndarray (preferred way):

如果 data list ndarray (首选方式):

If data is an ndarray or list, then index must be of the same length as data. If no index is passed, one will be created having values [0, 1, 2, ... len(data) - 1].

如果 data ndarray list ,则 index 必须与 data 具有相同的长度 如果没有传递索引,则将创建一个具有 [0, 1, 2, ... len(data) - 1] 值的索引

If data is a scalar value:

如果 data 是标量值:

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.

如果 data 是标量值,则 必须提供 index 该值将重复以匹配 index 的长度

If data is dict:

如果 data dict

If data is a dict, and - if index is passed the values in data corresponding to the labels in the index will be pulled out, otherwise - an index will be constructed from the sorted keys of the dict, if possible

如果 data 是一个 dict ,以及 - 如果 index 被传递的值在 data 对应于所述标签的 index 将被拉出,否则 - 一个 index 将从的排序键来构建 dict ,如果可能的话

SERIES IS LIKE NDARRAY AND DICT COMBINED

SERIES NDARRAY DICT 结合

Series acts very similar to an ndarray, and is a valid argument to most NumPy functions. However, things like slicing also slice the index. Series can be passed to most NumPy methods expecting an 1D ndarray.

Series 行为与 ndarray 非常相似 ,并且是大多数NumPy函数的有效参数。 但是,诸如切片之类的事情也会对索引进行切片。 系列可以传递给大多数期望一维 ndarray NumPy方法

A key difference between Series and ndarray is automatically alignment of the data based on labels during Series operations. Thus, you can write computations without giving consideration to whether the Series object involved has some non-unique labels. For example,

Series ndarray 之间的主要区别在于 Series 操作 期间,基于标签的数据自动对齐 因此,您可以编写计算而无需考虑所 涉及 Series 对象 是否 具有某些非唯一标签。 例如,

The result of an operation between unaligned Seriess will have the union of the indexes involved. If a label is not found in one series or the other, the result will be marked as missing NaN.

未对齐 Series 之间的运算结果 将具有所 涉及索引 并集 如果在一个序列或另一个序列中未找到标签,则结果将标记为缺少 NaN

Also note that in the above example, the index 'b' was duplicated, so s['b'] returns pandas.core.series.Series.

还要注意,在上面的示例中,索引 'b' 是重复的,因此 s['b'] 返回 pandas.core.series.Series

Series is also like a fixed-sized dict on which you can get and set values by index label. If a label is not contained while reading a value, KeyError exception is raised. Using the get method, a missing label will return None or specified default.

Series 还类似于固定大小的 dict ,您可以在其上通过索引标签获取和设置值。 如果在读取值时不包含标签, 则会引发 KeyError 异常。 使用 get 方法,缺少的标签将返回 None 或指定的默认值。

SERIES NAME ATTRIBUTE

SERIES NAME ATTRIBUTE

Series can also have a name attribute (like DataFrame can have columns attribute). This is important as a DataFrame can be seen as dict of Series objects.

Series 还可以具有 name 属性 (例如 DataFrame 可以具有 columns 属性)。 这是很重要 DataFrame 可以看作是 dict Series 对象

The Series name will be assigned automatically in many cases, in particular when taking 1D slices of DataFrame.

Series name 会自动在许多情况下采取的1D切片时进行分配,特别是 DataFrame

For example,

例如,

d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]}d = pd.DataFrame(d)d#    one  two# 0  1.0  4.0# 1  2.0  3.0# 2  3.0  2.0# 3  4.0  1.0type(d)                 #=> pandas.core.frame.DataFramed.columns               #=> Index(['one', 'two'], dtype='object')d.index                 #=> RangeIndex(start=0, stop=4, step=1)s = d['one']s# 0    1.0# 1    2.0# 2    3.0# 3    4.0# Name: one, dtype: float64type(s)                 #=> pandas.core.series.Seriess.name                  #=> 'one's.index                 #=> RangeIndex(start=0, stop=4, step=1)

You can rename a Series with pandas.Series.rename() method or by just assigning new name to name attribute.

您可以使用 pandas.Series.rename() 方法 重命名系列,也可以 仅将新名称分配给 name 属性。

s = pd.Series(np.random.randn(5), name='something')id(s)                             #=> 4331187280s.name                            #=> 'something's.name = 'new_name'id(s)                             #=> 4331187280s.name                            #=> 'new_name's.rename("yet_another_name")id(s)                             #=> 4331187280s.name                            #=> 'yet_another_name'

COMPLEX TRANSFORMATIONS ON SERIES USING SERIES.APPLY

使用 SERIES.APPLY SERIES 复杂的转换

NumPy is a popular toolkit for scientific computing. Pandas’ Series can be used as argument to most NumPy functions.

NumPy是用于科学计算的流行工具包。 熊猫Series可以用作大多数NumPy函数的参数。

For complex single-column transformation, you can use Series.apply. Like Python’s map function, Series.apply accepts as an argument a lambda function which is applied to each value.

对于复杂的单列转换,可以使用 Series.apply 像Python的 map 函数一样, Series.apply 接受一个lambda函数作为参数,该函数应用于每个值。

DataFrame (2D数据结构:表格) (DataFrame (2D data structure: Table))

Refer: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

请参阅: https : //pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

DataFrame is a 2D labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or a SQL table, or a dict of Series objects. Like Series, DataFrame accepts many different kinds of inputs.

DataFrame 是2D标签的数据结构,具有可能不同类型的列。 你可以认为它像一个电子表格或SQL表或 dict Series 对象 Series 一样 DataFrame 接受许多不同种类的输入。

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. Note that index can have non-unique elements (like that of Series). Similarly, columns names can also be non-unique.

与数据一起,您可以选择传递 index (行标签) columns (列标签) 参数。 请注意, index 可以具有 非唯一 元素(例如 Series 元素 )。 同样, columns 名也可以 是非唯一的

If you pass an index and/or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index (similar to passing a dict as data to Series).

如果传递 index 和/或 columns ,则可以 保证 所得 DataFrame index 和/或 columns 因此, dict Series 加上特定的索引将 丢弃 所有数据不匹配到所传递的索引(类似于传递一个 dict 作为 data Series )。

If axis labels (index) are not passed, they will be constructed from the input data based on common sense rules.

如果 未传递 轴标签( index ),则将基于常识规则根据输入数据构造它们。

CREATING DATAFRAMEFrom a dict of ndarrays/lists

CREATING DATAFRAME dict ndarray S / list 小号

The ndarrays/lists must all be the same length. If an index is passed, it must clearly also be the same length as that of data ndarrays. If no index is passed, then implicit index will be range(n), where n is the array length.

ndarray S / list 小号都必须是相同的长度 如果 传递 index ,则它显然也必须 与数据 ndarray s的 长度相同 如果没有 传递 index ,则隐式 index 将是 range(n) ,其中 n 是数组长度。

For example,

例如,

From a dict of Series (preferred way):

dict Series (首选方式):

The resultant index will be the union of the indexes of the various Series (each Series may be of a different length and may have different index). If there are nested dicts, these will be first converted to Series. If no columns are passed, the columns will be list of dict keys. For example,

将得到的 index 各个的索引的 联合 Series (各 Series 可以是不同的长度,并且可具有不同的 index )。 如果存在嵌套的 dict ,则将它们首先转换为 Series 如果没有 columns 都过去了, columns 将是 list dict 键。 例如,

The row and column labels can be accessed respectively by accessing the index and columns attributes.

可以通过访问 index columns 属性 分别访问行和列标签

From a list of dicts:

dict list

For example,

例如,

data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]pd.DataFrame(data2)#    a   b     c# 0  1   2   NaN# 1  5  10  20.0pd.DataFrame(data2, index=['first', 'second'])#         a   b     c# first   1   2   NaN# second  5  10  20.0pd.DataFrame(data2, columns=['a', 'b'])#    a   b# 0  1   2# 1  5  10

From a Series:

Series

The result will be a DataFrame with the same index as the input Series, and with one column whose name is the original name of the Series (only if no other column name provided).

结果将是一个 DataFrame ,其 index 与输入 Series 相同 ,并且其中一列的名称是 Series 的原始名称 (仅当未提供其他列名称时)。

For example,

例如,

s = pd.Series([1., 2., 3.], index=['a', 'b', 'c'])type(s)                 #=> pandas.core.series.Seriesdf2 = pd.DataFrame(s)df2#      0# a  1.0# b  2.0# c  3.0type(df2)               #=> pandas.core.frame.DataFramedf2.columns             #=> RangeIndex(start=0, stop=1, step=1)df2.index               #=> Index(['a', 'b', 'c'], dtype='object')

From a Flat File

从平面文件

The pandas.read_csv (preferred way):

所述 pandas.read_csv (优选的方式):

You can read CSV files into a DataFrame using pandas.read_csv() method. Refer to the official docs for its signature.

您可以 使用 pandas.read_csv() 方法将 CSV文件读取到 DataFrame 请参阅官方文档以获取其签名。

For example,

例如,

CONSOLE DISPLAY AND SUMMARY

控制台显示和摘要

Some helpful methods and attributes:

一些有用的方法和属性:

Wide DataFrames will be printed (print) across multiple rows by default. You can change how much to print on a single row by setting display.width option. You can adjust the max width of the individual columns by setting display.max_colwidth.

默认情况下,宽数据框将跨多行打印( print )。 您可以通过设置display.width选项更改在单行上打印的数量。 您可以通过设置display.max_colwidth来调整各个列的最大宽度。

pd.set_option('display.width', 40)     # default is 80pd.set_option('display.max_colwidth', 30)

You can also display display.max_colwidth feature via the expand_frame_repr option. This will print the table in one block.

您还可以通过expand_frame_repr选项显示display.max_colwidth功能。 这将把表格打印成一个块。

INDEXING ROWS AND SELECTING COLUMNS

索引行和选择列

The basics of DataFrame indexing and selecting are as follows:

DataFrame 索引和选择 的基础 如下:

For example,

例如,

d = {  'one' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'a']),  'two' : pd.Series(['A', 'B', 'C', 'D'], index=['a', 'b', 'c', 'a'])}df = pd.DataFrame(d)df#    one two# a  1.0   A# b  2.0   B# c  3.0   C# a  4.0   Dtype(df['one'])                   #=> pandas.core.series.Seriesdf['one']# a    1.0# b    2.0# c    3.0# a    4.0# Name: one, dtype: float64type(df[['one']])                 #=> pandas.core.frame.DataFramedf[['one']]#    one# a  1.0# b  2.0# c  3.0# a  4.0type(df[['one', 'two']])          #=> pandas.core.frame.DataFramedf[['one', 'two']]#    one two# a  1.0   A# b  2.0   B# c  3.0   C# a  4.0   Dtype(df.loc['a'])                 #=> pandas.core.frame.DataFramedf.loc['a']#    one two# a  1.0   A# a  4.0   Dtype(df.loc['b'])                 #=> pandas.core.series.Seriesdf.loc['b']# one    2# two    B# Name: b, dtype: objecttype(df.loc[['a', 'c']])          #=> pandas.core.frame.DataFramedf.loc[['a', 'c']]#    one two# a  1.0   A# a  4.0   D# c  3.0   Ctype(df.iloc[0])                  #=> pandas.core.series.Seriesdf.iloc[0]# one    1# two    A# Name: a, dtype: objectdf.iloc[1:3]#    one two# b  2.0   B# c  3.0   Cdf.iloc[[1, 2]]#    one two# b  2.0   B# c  3.0   Cdf.iloc[[1, 0, 1, 0]]#    one two# b  2.0   B# a  1.0   A# b  2.0   B# a  1.0   Adf.iloc[[True, False, True, False]]#    one two# a  1.0   A# c  3.0   C

COLUMN ADDITION AND DELETION

列添加和删除

You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations.

您可以将 DataFrame 语义上 索引相似的 Series 对象 dict 一样 对待 获取,设置和删除列的语法与类似 dict 操作的 语法相同

When inserting a Series that doesn’t have the same index as the DataFrame, it will be conformed to the DataFrame‘s index (that is, only values with index matching DataFrame‘s existing index will be added, and missing index will get NaN (of the same dtype as dtype of that particular column) as value.

当插入一个 Series 不具有相同 index DataFrame ,将符合该 DataFrame index (即只与价值 index 匹配 DataFrame 现有 index 将被添加和缺失 index 会得到 NaN (相同 dtype dtype 的特定列的)作为值。

When inserting a columns with scalar value, it will naturally be propagated to fill the column.

当插入具有标量值的列时,它自然会传播以填充该列。

When you insert a same length (as that of DataFrame to which it is inserted) ndarray or list, it just uses existing index of the DataFrame. But, try not to use ndarrays or list directly with DataFrames, intead you can first convert them to Series as follows:

当您插入相同长度( DataFrame 插入 DataFrame 相同 )的 ndarray list ,它仅使用 DataFrame 现有索引 不过,尽量不要用 ndarrays list 直接与 DataFrame S,这一翻译可以先它们转换为 Series 如下:

df['yet_another_col'] = array_of_same_length_as_df# is same asdf['yet_another_col'] = pd.Series(array_of_same_length_as_df, index=df.index)

For example,

例如,

By default, columns get inserted at the end. The insert() method is available to insert at a particular location in the columns.

默认情况下,列会插入到末尾。 所述 insert() 方法可在列中的一个特定的位置插入。

Columns can be deleted using del, like keys of dict.

可以使用 del 删除列 ,如dict键。

DATA ALIGNMENT AND ARITHMETICArithmetics between DataFrame objects

DataFrame 对象 之间的数据 DataFrame DataFrame

Data between DataFrame objects automatically align on both the columns and the index (row labels). Again, the resulting object will have the union of the column and row labels. For example,

DataFrame 对象 之间的数据会 自动 在列和索引(行标签) 对齐 同样,得到的对象将有 列和行标签结合 例如,

Important: You might like to try above example with duplicate columns names and index values in each individual data frame.

重要提示:您可能想尝试上面的示例,在每个单独的数据框中使用重复的列名和索引值。

Boolean operators (for example, df1 & df2) work as well.

布尔运算符(例如df1 & df2 )也可以工作。

Arithmetics between DataFrame and Series:

DataFrame Series 之间的算法

When doing an operation between DataFrame and Series, the default behavior is to broadcast Series row-wise to match rows in DataFrame and then arithmetics is performed. For example,

DataFrame Series 之间进行操作时 ,默认行为是按 广播 Series 以匹配 DataFrame 中的行 ,然后执行算术运算。 例如,

In the special case of working with time series data, and the DataFrame index also contains dates, the broadcasting will be column-wise. For example,

在使用时间序列数据的特殊情况下,并且 DataFrame 索引还包含日期,广播将按列进行。 例如,

Here pd.date_range() is used to create fixed frequency DatetimeIndex, which is then used as index (rather than default index of 0, 1, 2, ...) for a DataFrame.

这里pd.date_range()是用于创建固定频率DatetimeIndex,然后将其用作index (而非默认索引0, 1, 2, ... ),用于一个DataFrame

For explicit control over the matching and broadcasting behavior, see the section on flexible binary operations.

有关对匹配和广播行为的显式控制,请参见关于灵活的二进制操作的部分。

Arithmetics between DataFrame and Scalars

DataFrame 和标量 之间的算法

Operations with scalars are just as you would expect: broadcasted to each cell (that is, to all columns and rows).

标量运算与您期望的一样:广播到每个单元格(即,所有列和行)。

DATAFRAME METHODS AND FUNCTIONS

DATAFRAME 方法和功能

Evaluating string describing operations using eval() method

使用 eval() 方法 评估字符串描述操作

Note: Rather use assign() method.

注意:而是使用 assign() 方法。

The eval() evaluates a string describing operations on DataFrame columns. It operates on columns only, not specific rows or elements. This allows eval() to run arbitrary code, which can make you vulnerable to code injection if you pass user input into this function.

eval()评估一个字符串,该字符串描述对DataFrame列的操作。 它仅对列起作用,而不对特定的行或元素起作用。 这允许eval()运行任意代码,如果将用户输入传递给此函数,可能会使您容易受到代码注入的攻击。

df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)})df#    A   B# 0  1  10# 1  2   8# 2  3   6# 3  4   4# 4  5   2df.eval('2*A + B')# 0    12# 1    12# 2    12# 3    12# 4    12# dtype: int64

Assignment is allowed though by default the original DataFrame is not modified. Use inplace=True to modify the original DataFrame. For example,

允许分配,尽管默认情况下不修改原始DataFrame 。 使用inplace inplace=True修改原始DataFrame。 例如,

df.eval('C = A + 2*B', inplace=True)df#    A   B   C# 0  1  10  21# 1  2   8  18# 2  3   6  15# 3  4   4  12# 4  5   2   9

Assigning new columns to the copies in method chains — assign() method

在方法链中为副本分配新的列— assign() 方法

Inspired by dplyer‘s mutate verb, DataFrame has an assign() method that allows you to easily create new columns that are potentially derived from existing columns.

dplyer mutate 动词 启发 DataFrame 具有一个 assign() 方法,可让您轻松创建可能从现有列派生的新列。

The assign() method always returns a copy of data, leaving the original DataFrame untouched.

assign() 方法 总是返回数据的副本 中,而原始 DataFrame 不变。

Note: Also check pipe() method.

注意:还要检查 pipe() 方法。

df2 = df.assign(one_ratio = df['one']/df['out_of'])df2#    one  two  one_trunc  out_of  const  one_ratio# a  1.0  1.0        1.0     100      1       0.01# b  2.0  2.0        2.0     100      1       0.02# c  3.0  3.0        NaN     100      1       0.03# d  NaN  4.0        NaN     100      1        NaNid(df)                  #=> 4436438040id(df2)                 #=> 4566906360

Above was an example of inserting a precomputed value. We can also pass in a function of one argument to be evaluated on the DataFrame being assigned to.

上面是插入预计算值的示例。 我们还可以传入一个参数的函数,以在 要分配给 其的 DataFrame 上求值。

df3 = df.assign(one_ratio = lambda x: (x['one']/x['out_of']))df3#    one  two  one_trunc  out_of  const  one_ratio# a  1.0  1.0        1.0     100      1       0.01# b  2.0  2.0        2.0     100      1       0.02# c  3.0  3.0        NaN     100      1       0.03# d  NaN  4.0        NaN     100      1        NaNid(df)                  #=> 4436438040id(df3)                 #=> 4514692848

This way you can remove a dependency by not having to use name of the DataFrame.

这样,您可以不必使用 DataFrame 名称来删除依赖 DataFrame

Appending rows with append() method

append() 方法 追加行

The append() method appends rows of other_data_frame DataFrame to the end of current DataFrame, returning a new object. The columns not in the current DataFrame are added as new columns.

append() 方法追加 other_data_frame DataFrame 到当前 DataFrame ,返回一个新对象。 当前 DataFrame 不在的列 将作为新列添加。

Its most useful syntax is:

它最有用的语法是:

<data_frame>.append(other_data_frame, ignore_index=False)

Here,

这里,

  • other_data_frame: Data to be appended in the form of DataFrame or Series/dict-like object, or a list of these.

    other_data_frameDataFrameDataFrame Series / dict的对象或它们的list形式附加的数据。

  • ignore_index: By default it is False. If it is True, then index labels of other_data_frame are ignored

    ignore_index :默认为False 。 如果为True ,则将忽略other_data_frame索引标签

Note: Also check concat() function.

注意:还要检查 concat() 函数。

For example,

例如,

df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))df#    A  B# 0  1  2# 1  3  4df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))df2#    A  B# 0  5  6# 1  7  8df.append(df2)#    A  B# 0  1  2# 1  3  4# 0  5  6# 1  7  8df.append(df2, ignore_index=True)#    A  B# 0  1  2# 1  3  4# 2  5  6# 3  7  8

The drop() method

drop() 方法

Note: Rather use del as stated in Column Addition and Deletion section, and indexing + re-assignment for keeping specific rows.

注意:最好使用“ 列添加和删除”部分中所述的 del ,并使用索引+重新分配来保留特定行。

The drop() function removes rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.

drop()函数通过指定标签名称和相应的轴,或通过直接指定索引或列名称来删除行或列。 使用多索引时,可以通过指定级别来删除不同级别上的标签。

The values attribute and copy() method

所述 values 属性和 copy() 方法

The values attribute

values 属性

The values attribute returns NumPy representation of a DataFrame‘s data. Only the values in the DataFrame will be returned, the axes labels will be removed. A DataFrame with mixed type columns (e.g. str/object, int64, float32) results in an ndarray of the broadest type that accommodates these mixed types.

所述 values 属性一个的返回NumPy的表示 DataFrame 的数据。 返回 DataFrame 的值, 将删除轴标签。 DataFrame 与混合型柱(例如STR /对象,Int64类型,FLOAT32)导致的 ndarray 容纳这些混合类型的最广泛的类型。

Check Console Display section for an example.

有关示例,请参见控制台显示部分。

The copy() method

copy() 方法

The copy() method makes a copy of the DataFrame object’s indices and data, as by default deep is True. So, modifications to the data or indices of the copy will not be reflected in the original object.

copy() 方法使副本 DataFrame 对象的指标和数据,因为默认情况下 deep True 因此,对副本的数据或索引的修改将不会反映在原始对象中。

If deep=False, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vica versa).

如果 deep=False ,将在不复制调用对象的数据或索引的情况下创建新对象(仅复制对数据和索引的引用)。 原始数据的任何更改都将反映在浅表副本中(反之亦然)。

Its syntax is:

其语法为:

df.copy(deep=True)

Transposing using T attribute or transpose() method

使用 T 属性或 transpose() 方法进行 transpose()

Refer section Arithmetic, matrix multiplication, and comparison operations.

请参阅算术,矩阵乘法和比较运算部分。

To transpose, you can call the method transpose(), or you can the attribute T which is accessor to transpose() method.

要进行转置,可以调用方法 transpose() ,也可以使用属性 T ,它是 transpose() 方法的 访问者

The result is a DataFrame as a reflection of original DataFrame over its main diagonal by writing rows as columns and vice-versa. Transposing a DataFrame with mixed dtypes will result in a homogeneous DataFrame with the object dtype. In such a case, a copy of the data is always made.

结果是 通过将行写为列, DataFrame 反映 在其主对角线上 的原始 DataFrame ,反之亦然。 移调一个 DataFrame 具有混合dtypes将导致同质 DataFrame 与对象D型。 在这种情况下,始终会复制数据。

For example,

例如,

Sorting (sort_values(), sort_index()), Grouping (groupby()), and Filtering (filter())

排序( sort_values() sort_index() ),分组( groupby() )和过滤( filter() )

The sort_values() method

所述 sort_values() 方法

Dataframe can be sorted by a column (or by multiple columns) using sort_values() method.

可以使用 sort_values() 方法 按一列(或按多列)对数据 sort_values() 进行排序

For example,

例如,

The sort_index() method

所述 sort_index() 方法

The sort_index() method can be used to sort by index.

所述 sort_index() 方法可用于通过排序 index

For example,

例如,

The groupby() method

所述 groupby() 方法

The groupby() method is used to group by a function, label, or a list of labels.

所述 groupby() 方法由功能,标签或标签的列表用于组。

For example,

例如,

The filter() method

所述 filter() 方法

The filter() method returns subset of rows or columns of DataFrame according to labels in the specified index. Note that this method does not filter a DataFrame on its contents, the filter is applied to the labels of the index, or to the column names.

所述 filter() 方法返回的行或列的子集 DataFrame 根据在指定的索引标签。 请注意,此方法不会 在其内容上 过滤 DataFrame 将过滤器应用于索引的标签或列名。

You can use items, like and regex parameters, but note that they are enforced to be mutually exclusive. The parameter axis default to the info axis that is used when indexing with [].

您可以使用 items like regex 参数,但请注意,它们必须相互排斥。 参数 axis 默认为使用 [] 索引时使用的信息轴

For example,

例如,

Melting and Pivoting using melt() and pivot() methods

使用 melt() pivot() 方法 进行 melt() pivot()

The idea of melt() is to keep keep few given columns as id-columns and convert rest of the columns (called variable-columns) into variable and value, where the variable tells you the original columns name and value is corresponding value in original column.

melt() 的想法 是保留给定的列作为 id列 ,并将其余的列(称为 variable-columns )转换为 variable value ,其中 变量 告诉您原始列的名称和 value 原始列 中的对应值柱。

If there are n variable-columns which are melted, then information from each row from the original formation is not spread to n columns.

如果有n 可变柱被熔化,那么来自原始地层的每一行的信息都不会传播到n列。

The idea of pivot() is to do just the reverse.

pivot() 的想法 只是相反。

For example,

例如,

Piping (chaining) Functions using pipe() method

使用 pipe() 方法 进行管道(链接)函数

Suppose you want to apply a function to a DataFrame, Series or a groupby object, to its output then apply other, other, … functions. One way would be to perform this operation in a “sandwich” like fashion:

假设您要将一个函数应用于 DataFrame Series groupby 对象,然后将其应用于输出,然后再应用其他其他函数。 一种方法是以类似“三明治”的方式执行此操作:

Note: Also check assign() method.

注意:还要检查 assign() 方法。

df = foo3(foo2(foo1(df, arg1=1), arg2=2), arg3=3)

In the long run, this notation becomes fairly messy. What you want to do here is to use pipe(). Pipe can be though of as a function chaining. This is how you would perform the same task as before with pipe():

从长远来看,这种表示会变得很混乱。 您要在此处使用 pipe() 管道可以作为 函数链接 这就是您使用 pipe() 执行与以前相同的任务的方式

df.pipe(foo1, arg1=1).    pipe(foo2, arg2=2).    pipe(foo3, arg3=3)

This way is a cleaner way that helps keep track the order in which the functions and its corresponding arguments are applied.

这种方式是一种更简洁的方式,有助于跟踪应用函数及其相应参数的顺序。

Rolling Windows using rolling() method

使用 rolling() 方法 滚动Windows

Use DataFrame.rolling() for rolling window calculation.

使用 DataFrame.rolling() 进行滚动窗口计算。

Other DataFrame Methods

其他 DataFrame 方法

Refer Methods section in pd.DataFrame.

请参阅 pd.DataFrame 方法” 部分

Refer Computational Tools User Guide.

请参阅《 计算工具 用户指南》。

Refer the categorical listing at Pandas API.

请参阅 Pandas API上 的分类清单

APPLYING FUNCTIONSThe apply() method: apply on columns/rows

应用函数 apply() 方法:应用于列/行

The apply() method applies the given function along an axis (by default on columns) of the DataFrame.

apply() 方法应用于沿轴线(默认上列)的给定的函数 DataFrame

Its most useful form is:

它最有用的形式是:

df.apply(func, axis=0, args, **kwds)

Here:

这里:

  • func: The function to apply to each column or row. Note that it can be a element-wise function (in which case axis=0 or axis=1 doesn’t make any difference) or an aggregate function.

    func :应用于每个列或行的函数。 请注意,它可以是按元素的函数(在这种情况下, axis=0axis=1没有任何区别)或聚合函数。

  • axis: Its value can be 0 (default, column) or 1. 0 means applying function to each column, and 1 means applying function to each row. Note that this axis is similar to how axis are defined in NumpPy, as for 2D ndarray, 0 means column.

    axis :其值可以是0 (默认值,列)或10表示将功能应用于每一列,而1表示将功能应用于每一行。 请注意,此axis类似于在NumpPy中定义轴的方式,对于2D ndarray, 0表示列。

  • args: It is a tuple and represents the positional arguments to pass to func in addition to the array/series.

    args :这是一个tuple ,表示除了数组/系列之外还传递给func的位置参数。

  • **kwds: It represents the additional keyword arguments to pass as keywords arguments to func.

    **kwds :它表示要作为func关键字参数传递的其他关键字参数。

It returns a Series or a DataFrame.

它返回 Series DataFrame

For example,

例如,

The applymap() method: apply element-wise

所述 applymap() 方法:应用逐元素

The applymap() applies the given function element-wise. So, the given func must accept and return a scalar to every element of a DataFrame.

所述 applymap() 应用于给定的函数逐元素。 因此,给定的 func 必须接受并将标量返回给 DataFrame 每个元素

Its general syntax is:

其一般语法为:

df.applymap(func)

For example,

例如,

When you need to apply a function element-wise, you might like to check first if there is a vectorized version available. Note that a vectorized version of func often exist, which will be much faster. You could square each number element-wise using df.applymap(lambda x: x**2), but the vectorized version df**2 is better.

当需要按 元素 应用函数时 ,您可能想先检查是否有矢量化版本。 注意矢量版本 func 常同时存在,这会快很多。 您可以使用 df.applymap(lambda x: x**2) 每个数字逐个平方 ,但是矢量化版本 df**2 更好。

WORKING WITH MISSING DATA

处理丢失的数据

Refer SciKit-Learn’s Data Cleaning section.

请参阅SciKit-Learn的“数据清理”部分。

Refer Missing Data Guide and API Reference for Missing Data Handling: dropna, fillna, replace, interpolate.

有关丢失数据的处理 请参阅《 丢失数据指南》 和《 API参考》: dropna fillna replace interpolate

Also check Data Cleaning section of The tf.feature_column API on other options.

还要 在其他选项上 查看 tf.feature_column API 数据清理” 部分

Also go through https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/

也可以通过https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/

NORMALIZING DATA

归一化数据

One way is to perform df / df.iloc[0], which is particular useful while analyzing stock price over a period of time for multiple companies.

一种方法是执行 df / df.iloc[0] ,这在分析多个公司一段时间内的股价时特别有用。

THE CONCAT() FUNCTION

THE CONCAT() 函数

The concat() function performs concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axis.

所述 concat() 沿轴线功能执行级联操作,而执行索引的可选的一组逻辑(集或交集)(如果有的话)在另一轴上。

The default axis of concatenation is axis=0, but you can choose to concatenate data frames sideways by choosing axis=1.

默认的串联 axis=0 axis=0 ,但是您可以通过选择 axis=1 来选择横向串联数据帧

Note: Also check append() method.

注意:还要检查 append() 方法。

For example,

例如,

df1 = pd.DataFrame(    {        'A': ['A0', 'A1', 'A2', 'A3'],        'B': ['B0', 'B1', 'B2', 'B3'],        'C': ['C0', 'C1', 'C2', 'C3'],        'D': ['D0', 'D1', 'D2', 'D3']    }, index=[0, 1, 2, 3])df2 = pd.DataFrame(    {        'A': ['A4', 'A5', 'A6', 'A7'],        'B': ['B4', 'B5', 'B6', 'B7'],        'C': ['C4', 'C5', 'C6', 'C7'],        'D': ['D4', 'D5', 'D6', 'D7']    }, index=[4, 5, 6, 7])df3 = pd.DataFrame(    {        'A': ['A8', 'A9', 'A10', 'A11'],        'B': ['B8', 'B9', 'B10', 'B11'],        'C': ['C8', 'C9', 'C10', 'C11'],        'D': ['D8', 'D9', 'D10', 'D11']    }, index=[1, 2, 3, 4])frames = [df1, df2, df3]df4 = pd.concat(frames)df4#      A    B    C    D# 0   A0   B0   C0   D0# 1   A1   B1   C1   D1# 2   A2   B2   C2   D2# 3   A3   B3   C3   D3# 4   A4   B4   C4   D4# 5   A5   B5   C5   D5# 6   A6   B6   C6   D6# 7   A7   B7   C7   D7# 1   A8   B8   C8   D8# 2   A9   B9   C9   D9# 3  A10  B10  C10  D10# 4  A11  B11  C11  D11df5 = pd.concat(frames, ignore_index=True)df5#       A    B    C    D# 0    A0   B0   C0   D0# 1    A1   B1   C1   D1# 2    A2   B2   C2   D2# 3    A3   B3   C3   D3# 4    A4   B4   C4   D4# 5    A5   B5   C5   D5# 6    A6   B6   C6   D6# 7    A7   B7   C7   D7# 8    A8   B8   C8   D8# 9    A9   B9   C9   D9# 10  A10  B10  C10  D10# 11  A11  B11  C11  D11df5 = pd.concat(frames, keys=['s1', 's2', 's3'])df5#         A    B    C    D# s1 0   A0   B0   C0   D0#    1   A1   B1   C1   D1#    2   A2   B2   C2   D2#    3   A3   B3   C3   D3# s2 4   A4   B4   C4   D4#    5   A5   B5   C5   D5#    6   A6   B6   C6   D6#    7   A7   B7   C7   D7# s3 1   A8   B8   C8   D8#    2   A9   B9   C9   D9#    3  A10  B10  C10  D10#    4  A11  B11  C11  D11df5.index# MultiIndex(levels=[['s1', 's2', 's3'], [0, 1, 2, 3, 4, 5, 6, 7]],#            labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2], [0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4]])

Like its sibling function on ndarrays, numpy.concatenate(), the pandas.concat() takes a list or dict of homogeneously-typed objects and concatenates them with some configurable handling of “what to do with other axes”.

像其在ndarrays上的同级函数numpy.concatenate()一样, numpy.concatenate() pandas.concat()同类对象的列表或字典,并通过“可与其他轴做什么”的某种可配置处理将它们连接起来。

MERGING AND JOINING USING MERGE() AND JOIN() FUNCTIONS

使用 MERGE() JOIN() 函数 合并和加入

Refer Mergem Join, and Concatenate official guide.

请参阅 Mergem Join和Concatenate 官方指南。

The function merge() merges DataFrame object by performing a database-style join operation by columns or indexes.

函数 merge() 通过按列或索引执行数据库样式的 DataFrame 操作来 合并 DataFrame 对象。

The function join() joins columns with other DataFrame either on index or on a key column.

函数 join() 在索引或键列 DataFrame 列与其他 DataFrame 连接起来

BINARY DUMMY VARIABLES FOR CATEGORICAL VARIABLES USING GET_DUMMIES() FUNCTION

使用 GET_DUMMIES() 函数的化学 变量的 GET_DUMMIES() 变量

To convert a categorical variable into a “dummy” DataFrame can be done using get_dummies():

可以使用 get_dummies() 将类别变量转换为“虚拟” DataFrame

df = pd.DataFrame({'char': list('bbacab'), 'data1': range(6)})df#   char  data1# 0    b      0# 1    b      1# 2    a      2# 3    c      3# 4    a      4# 5    b      5dummies = pd.get_dummies(df['char'], prefix='key')dummies#    key_a  key_b  key_c# 0      0      1      0# 1      0      1      0# 2      1      0      0# 3      0      0      1# 4      1      0      0# 5      0      1      0

PLOTTING DATAFRAME USING PLOT() FUNCTION

使用PLOT()函数绘制数据DATAFRAME

The plot() function makes plots of DataFrame using matplotlib/pylab.

plot() 函数使得地块 DataFrame 使用matplotlib / pylab。

面板(3D数据结构) (Panel (3D data structure))

Panel is a container for 3D data. The term panel data is derived from econometrics and is partially responsible for the name: pan(el)-da(ta)-s.

面板是3D数据的容器。 面板数据一词源自计量经济学,部分负责名称: pan(el)-da(ta)-s

The 3D structure of a Panel is much less common for many types of data analysis, than the 1D of the Series or the 2D of the DataFrame. Oftentimes, one can simply use a Multi-index DataFrame for easily working with higher dimensional data. Refer Deprecate Panel.

Series的1D或DataFrame的2D相比, Panel的3D结构在许多类型的数据分析中要少DataFrame 。 通常,人们可以简单地使用Multi-index DataFrame来轻松处理高维数据。 请参阅“ 不赞成使用面板” 。

Here are some related interesting stories that you might find helpful:

以下是一些相关的有趣故事,您可能会觉得有帮助:

  • Fluent NumPy

    流利的数字

  • Distributed Data Processing with Apache Spark

    使用Apache Spark进行分布式数据处理

  • Apache Cassandra — Distributed Row-Partitioned Database for Structured and Semi-Structured Data

    Apache Cassandra —用于结构化和半结构化数据的分布式行分区数据库

  • The Why and How of MapReduce

    MapReduce的原因和方式

翻译自: https://medium.com/analytics-vidhya/fluent-pandas-22473fa3c30d

熊猫分发


http://www.taodudu.cc/news/show-994811.html

相关文章:

  • python记录日志_5分钟内解释日志记录—使用Python演练
  • p值 t值 统计_非统计师的P值
  • 如何不部署Keras / TensorFlow模型
  • 对食材的敬畏之心极致产品_这些数据科学产品组合将给您带来敬畏和启发(2020年中的版本)
  • 向量积判断优劣弧_判断经验论文优劣的10条诫命
  • sql如何处理null值_如何正确处理SQL中的NULL值
  • 数据可视化 信息可视化_动机可视化
  • 快速数据库框架_快速学习新的数据科学概念的框架
  • 停止使用p = 0.05
  • 成像数据更好的展示_为什么更多的数据并不总是更好
  • vue domo网站_DOMO与Tableau-逐轮
  • 每个人都应该使用的Python 3中被忽略的3个功能
  • 数据探查_数据科学家,开始使用探查器
  • 从ncbi下载数据_如何从NCBI下载所有细菌组件
  • 线性插值插值_揭秘插值搜索
  • 如果您不将Docker用于数据科学项目,那么您将生活在1985年
  • docker部署flask_使用Docker,GCP Cloud Run和Flask部署Scikit-Learn NLP模型
  • 问卷 假设检验 t检验_真实问题的假设检验
  • 大数据技术 学习之旅_为什么聚焦是您数据科学之旅的关键
  • 无监督学习 k-means_无监督学习-第4部分
  • 深度学习算法原理_用于对象检测的深度学习算法的基本原理
  • 软件本地化 pdf_软件本地化与标准翻译
  • 数据库不停机导数据方案_如何计算数据停机成本
  • python初学者_面向初学者的20种重要的Python技巧
  • 贝叶斯网络建模
  • 数据科学家数据分析师_使您的分析师和数据科学家在数据处理方面保持一致
  • python db2查询_如何将DB2查询转换为python脚本
  • 爱因斯坦提出的逻辑性问题_提出正确问题的重要性
  • 餐厅数据分析报告_如何使用数据科学选择理想的餐厅设计场所
  • 熊猫直播 使用什么sdk_没什么可花的-但是16项基本操作才能让您开始使用熊猫

熊猫分发_流利的熊猫相关推荐

  1. 熊猫分发_熊猫新手:第一部分

    熊猫分发 For those just starting out in data science, the Python programming language is a pre-requisite ...

  2. 熊猫分发_与熊猫度假

    熊猫分发 While working on a project recently, I had to work with time series data spread over a year. I ...

  3. 熊猫分发_实用熊猫指南

    熊猫分发 Pandas is a very powerful and versatile Python data analysis library that expedites the data an ...

  4. 熊猫分发_熊猫新手:第二部分

    熊猫分发 This article is a continuation of a previous article which kick-started the journey to learning ...

  5. 熊猫分发_熊猫实用指南

    熊猫分发 什么是熊猫? (What is Pandas?) Pandas is an open-source data analysis and manipulation tool for Pytho ...

  6. 熊猫分发_熊猫重命名列和索引

    熊猫分发 Sometimes we want to rename columns and indexes in the Pandas DataFrame object. We can use pand ...

  7. 熊猫分发_熊猫cut()函数示例

    熊猫分发 1.熊猫cut()函数 (1. Pandas cut() Function) Pandas cut() function is used to segregate array element ...

  8. 熊猫分发_熊猫下降列和行

    熊猫分发 1. Pandas drop()函数语法 (1. Pandas drop() Function Syntax) Pandas DataFrame drop() function allows ...

  9. 用python绘制熊猫图案_绘制带有熊猫和Matplotlib的一分钟烛台

    给出了以下Pandas数据帧的示例date open high low close volume 0 2015-03-13 08:00:00 71.602 71.637 71.427 71.539 0 ...

最新文章

  1. 图片的略小图图片不显示的处理方法
  2. IPv4_数据报文首部格式
  3. Python学习笔记9—文件
  4. .NET 6新特性试用 | TryGetNonEnumeratedCount
  5. Oracle在HPUX IA64平台登陆缓慢问题分析
  6. hdu 1861 游船出租 tag:模拟
  7. 年轻人必须在北上广工作吗
  8. Web Api 返回图片流给前端
  9. Zynq-7000基于zynq平台裸跑LWIP协议栈的详解(万字长文)
  10. python实现自动打电话软件_电销自动打电话app
  11. iOS国际化(多语言)App名称国际化
  12. win7局域网拷贝其他计算机文件,局域网共享,教您win7局域网文件共享怎么设置
  13. Android的Scroller介绍
  14. html横幅设置,如何控制HTML横幅的宽度和高度?
  15. 小程序项目:基于微信小程序社区疫情防控系统——计算机毕业设计
  16. php new object delete,DeleteObject()函数
  17. 清华大学计算机系研究生培养方案,攻读硕士学位研究生培养方案
  18. linux升级内核ivh,Linux内核升级
  19. python 数据类型:整型 字符串 布尔值 列表 元组 字典 集合
  20. 为什么cmake mysql_cmake 为什么没反应

热门文章

  1. 蓝桥杯嵌入式第七届模拟题 代码
  2. malloc,calloc,realloc
  3. 关于非阻塞的recv的时候返回的处理
  4. Java高级:mysqllimit两个参数
  5. Linux网络配置:设置IP地址、网关DNS、主机名
  6. 感想篇:4)越来越精简的机械设计
  7. 关于反射Assembly.Load(程序集).CreateInstance(命名空间.类)
  8. Delphi工具之Image Editor
  9. 实现pick和reigister
  10. a article test