熊猫数据集

Tips and Tricks for Data Science

数据科学技巧与窍门

Pandas is a powerful and easy-to-use software library written in the Python programming language, and is used for data manipulation and analysis.

Pandas是使用Python编程语言编写的功能强大且易于使用的软件库，可用于数据处理和分析。

Installing pandas: https://pypi.org/project/pandas/

安装熊猫： https : //pypi.org/project/pandas/

pip install pandas

什么是Pandas DataFrame？ (What is a Pandas DataFrame?)

A pandas DataFrame is a two dimensional data structure which stores data in a tabular form. Every row and column are labeled and can hold data of any type.

pandas DataFrame是二维数据结构，以表格形式存储数据。每行和每列都有标签，可以保存任何类型的数据。

Here is an example:

这是一个例子：

First 3 rows of the Titanic: Machine Learning from Disaster dataset

1.创建一个熊猫DataFrame (1. Creating a pandas DataFrame)

The pandas.DataFrame constructor:

pandas.DataFrame构造函数：

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False

data This parameter serves as the input to make a DataFrame, which could be a NumPy ndarray, iterable, dict or another DataFrame. An ndarray is a multidimensional container of items of the same type and size. An iterable is any Python object capable of returning its members one at a time, permitting to be iterated over in a for-loop. Some examples for iterables are lists, tuples and sets. Dict here can refer to pandas Series, arrays, constants or list-like objects.

data此参数用作制作DataFrame的输入，该DataFrame可以是NumPy ndarray，可迭代，dict或另一个DataFrame 。 ndarray是具有相同类型和大小的项目的多维容器。 可迭代对象是能够一次返回其成员并允许在for循环中对其进行迭代的任何Python对象。 可迭代的一些示例是列表，元组和集合。这里的Dict可以引用pandas系列，数组，常量或类似列表的对象。

indexThis parameter could have an Index or an array-like data type and serves as the index for the row labels in the resulting DataFrame. If no indexing information is provided, this parameter will default to RangeIndex.

index此参数可以具有Index或类似数组的数据类型，并用作结果DataFrame中行标签的索引。如果没有提供索引信息，则此参数将默认为RangeIndex 。

columnsThis parameter could have an Index or an array-like data type and serves as the index for the column labels in the resulting DataFrame. If no indexing information is provided, this parameter will default to RangeIndex.

columns此参数可以具有Index或类似数组的数据类型，并用作结果DataFrame中列标签的索引。如果没有提供索引信息，则此参数将默认为RangeIndex 。

dtypeEach column in the DataFrame can only have a single data type. This parameter is used to force a certain data type. By default, datatype is inferred from data.

DTYPE在数据帧的每一列只能有一种数据类型。此参数用于强制某种数据类型。默认情况下，从数据推断出数据类型。

copyWhen this parameter is set to True, and the input data is a DataFrame or a 2D ndarray, data is copied into the resulting DataFrame. By default, copy is set to False.

复制如果将此参数设置为True，并且输入数据是DataFrame或2D ndarray，则将数据复制到结果DataFrame中。默认情况下，复制设置为False。

从Python字典创建Pandas DataFrame (Creating a Pandas DataFrame from a Python Dictionary)

import pandas as pd

d = {'Name' : ['John', 'Adam', 'Jane'], 'Age' : [25, 18, 30]}pd.DataFrame(d)

The index parameter can be used to change the default row index and the columns parameter can be used to change the order of the keys:

index参数可用于更改默认行索引， columns参数可用于更改键的顺序：

d = {'Name' : ['John', 'Adam', 'Jane'], 'Age' : [25, 18, 30]}pd.DataFrame(d, index=[10, 20, 30], columns=['First Name', 'Current Age'])

从列表创建Pandas DataFrame： (Creating a Pandas DataFrame from a list:)

l = [['John', 25], ['Adam', 18], ['Jane', 30]]pd.DataFrame(l, columns=['Name', 'Age'])

从文件创建Pandas DataFrame (Creating a Pandas DataFrame from a File)

For any Data Science process, the dataset is commonly stored in files having formats like CSV (Comma Separated Values). Pandas allows storing data along with their labels from a CSV file using the method pandas.read_csv().

对于任何数据科学过程，数据集通常存储在具有CSV(逗号分隔值)之类的格式的文件中。 Pandas允许使用pandas.read_csv()方法将数据及其标签中的数据与CSV文件一起存储。

2.从Pandas DataFrame中选择行和列 (2. Selecting Rows and Columns from a Pandas DataFrame)

从Pandas DataFrame中选择列 (Selecting Columns from a Pandas DataFrame)

Columns can be selected using their column names.

可以使用列名称选择列。

df[column_1, column_2])

df[ column_1 , column_2 ])

Selecting column ‘Name’ from DataFrame df

从Pandas DataFrame中选择行 (Selecting Rows from a Pandas DataFrame)

Pandas provides 2 attributes for selecting rows from a DataFrame: loc and iloc

Pandas提供了2个用于从DataFrame中选择行的属性： loc和iloc

loc is label-based, which means that the row label has to be specified and iloc is integer-based which means that the integer index has to be specified.

loc是基于标签的，这意味着必须指定行标签，而iloc是基于整数的，这意味着必须指定整数索引。

Using loc and iloc for selecting rows from DataFrame df

3.在Pandas DataFrame中插入行和列 (3. Inserting Rows and Columns to a Pandas DataFrame)

在Pandas DataFrame中插入行 (Inserting Rows to a Pandas DataFrame)

One method of inserting a row into a DataFrame is to create a pandas.Series() object and insert it at the end of the DataFrame using the pandas.DataFrame.append()method. The column indices of the DataFrame serve as the index attribute for the Series object.

将行插入DataFrame的一种方法是创建pandas.Series() 对象，然后使用pandas.DataFrame.append()方法将其插入DataFrame的pandas.DataFrame.append() 。 DataFrame的列索引用作Series对象的索引属性。

将列插入Pandas DataFrame (Inserting Columns to a Pandas DataFrame)

One easy method of adding a column to a DataFrame is by just referring to the new column and assigning values.

将列添加到DataFrame的一种简单方法是仅引用新列并分配值。

Inserting columns ID, Score and Country to DataFrame df

4.从Pandas DataFrame删除行和列 (4. Deleting Rows and Columns from a Pandas DataFrame)

从Pandas DataFrame删除行 (Deleting Rows from a Pandas DataFrame)

A row can be deleted using the method pandas.DataFrame.drop() with it’s row label.

可以使用带有行标签的pandas.DataFrame.drop()方法删除一行。

Deleting row with label 1 from DataFrame df

To delete a row based on a column, the index of the row is obtained using the DataFrame.index attribute and then the row with the index is deleted using the pandas.DataFrame.drop() method.

要删除基于列的行，请使用DataFrame.index属性获取该行的索引，然后使用pandas.DataFrame.drop()方法删除具有索引的行。

Deleting row with Name Kelly from DataFrame df

从Pandas DataFrame删除列 (Deleting Columns from a Pandas DataFrame)

A column can be deleted from a DataFrame based on its label as well as its position in the DataFrame using the method pandas.DataFrame.drop().

可以使用pandas.DataFrame.drop()方法根据列的标签及其在DataFrame中的位置从DataFrame中删除列。

Deleting column with label ‘Country’ from DataFrame df

Deleting column with position 2 from DataFrame df

The axis argument is set to 1 when dropping columns, and 0 when dropping rows.

删除列时， axis参数设置为1；删除行时， axis参数设置为0。

5.对Pandas DataFrame排序 (5. Sorting a Pandas DataFrame)

A Pandas DataFrame can be sorted using the pandas.DataFrame.sort_values() method. The by parameter for the method serves as the label of the column to sort by and ascending is set to True for sorting in ascending order and to False for sorting in descending order.

可以使用pandas.DataFrame.sort_values()方法对Pandas DataFrame进行排序。该方法的by参数用作要按其进行排序的列的标签，并且升序设置为True(以升序排序)，设置为False(以降序排序)。

Sorting DataFrame df by Name in ascending order

Sorting DataFrame df by Age in descending order

https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-pythonhttps://realpython.com/pandas-dataframe/#creating-a-pandas-dataframehttps://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htmhttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python https://realpython.com/pandas-dataframe/#creating-a-pandas-dataframe https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

翻译自: https://medium.com/ml-course-microsoft-udacity/5-fundamental-operations-on-a-pandas-dataframe-93b4384dff9d