提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

机器学习-数据科学库-day4

pandas学习
- pandas基础
- - pandas的常用数据类型
- pandas之Series创建
- - pandas之Series切片和索引
  - pandas之Series的索引和值
  - 练习
- pandas之读取外部数据
- pandas之DataFrame
- - 练习
  - pandas之取行或者列
  - pandas之 loc与 iloc
  - pandas之布尔索引
  - pandas之字符串方法
  - 缺失数据的处理
  - pandas常用统计方法
  - 动手
- day4小结

pandas学习

pandas基础

引言：
numpy已经能够帮助我们处理数据，能够结合matplotlib解决我们数据分析的问题，那么pandas学习的目的在什么地方呢？

numpy能够帮助我们处理数值，但是pandas除了处理数值之外(基于numpy)，还能够帮助我们处理其他类型的数据。（比如字符串，还有时间序列等）

pandas的常用数据类型

Series 一维，带标签数组
DataFrame 二维，Series容器

pandas之Series创建

Series对象本质上由两个数组构成，一个数组构成对象的键(index，索引)，一个数组构成对象的值(values), 键–>值

pandas之Series切片和索引

pandas之Series的索引和值

练习

#!usr/bin/env python
# -*- coding:utf-8 _*-
import stringimport numpy as np
import pandas as pd
#Series 一维，带标签的数组
#pandas之Series创建
t1=pd.Series([1,2,3,4,12,3,4])
print(t1)
print(type(t1))
print("*"*50)t2=pd.Series(np.arange(10),index=list(string.ascii_uppercase[:10]))
print(t2)
print(type(t2))
print("*"*50)temp_dict={"name":"xiaohong","age":30,"tel":10086}
t3=pd.Series(temp_dict)
print(t3)
print(type(t3))
print("*"*50)
#pandas之Series切片和索引
print(t3["age"])
print(t3["tel"])
print(t3[0])
print(t3[:2])   #取前两行
print(t3[[0,2]])    #取第一行和第三行
print(t3[["age","name"]])print(t1[t1>3]) #布尔运算#Series对象本质上由两个数组构成，一个数组构成对象的键(index，索引)，一个数组构成对象的值(values), 键-->值
print(t2.index)
print(t2.values)
print(type(t2.index))
print(type(t2.values))

运行结果：

C:\ANACONDA\python.exe "C:/Users/Lenovo/PycharmProjects/Code/day04/test pandas_Series.py"
0     1
1     2
2     3
3     4
4    12
5     3
6     4
dtype: int64
<class 'pandas.core.series.Series'>
**************************************************
A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
J    9
dtype: int32
<class 'pandas.core.series.Series'>
**************************************************
name    xiaohong
age           30
tel        10086
dtype: object
<class 'pandas.core.series.Series'>
**************************************************
30
10086
xiaohong
name    xiaohong
age           30
dtype: object
name    xiaohong
tel        10086
dtype: object
age           30
name    xiaohong
dtype: object
3     4
4    12
6     4
dtype: int64
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'], dtype='object')
[0 1 2 3 4 5 6 7 8 9]
<class 'pandas.core.indexes.base.Index'>
<class 'numpy.ndarray'>Process finished with exit code 0

pandas之读取外部数据

现在假设我们有一个组关于狗的名字的统计数据，那么为了观察这组数据的情况，我们应该怎么做呢？

我们的这组数据存在csv中，我们直接使用pd. read_csv即可

#!usr/bin/env python
# -*- coding:utf-8 _*-
import pandas as pd#pandas 读取csv中的文件
df=pd.read_csv("./dogNames2.csv")
print(df)

运行结果：

C:\ANACONDA\python.exe C:/Users/Lenovo/PycharmProjects/Code/day04/page108.pyRow_Labels  Count_AnimalName
0              1                 1
1              2                 2
2          40804                 1
3          90201                 1
4          90203                 1
...          ...               ...
16215      37916                 1
16216      38282                 1
16217      38583                 1
16218      38948                 1
16219      39743                 1[16220 rows x 2 columns]Process finished with exit code 0

pandas之DataFrame

DataFrame对象既有行索引，又有列索引
行索引，表明不同行，横向索引，叫index，0轴，axis=0
列索引，表名不同列，纵向索引，叫columns，1轴，axis=1

t1=pd.DataFrame(np.arange(12).reshape((3,4)),index=list(string.ascii_uppercase[:3]),columns=list(string.ascii_uppercase[-4:]))
print(t1)
print("*"*50)
*****************************************W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11
**************************************************

和一个ndarray一样，我们通过shape，ndim，dtype了解这个ndarray的基本信息，那么对于DataFrame我们有什么方法了解呢

#DataFrame传入字典作为数据
d1={"name":["xiaoming","xiaogang"],"age":[23,32],"tel":[10086,10010]}
d1=pd.DataFrame(d1)
print(d1)
print(type(d1))
print("*"*50)d2=[{"name":"xiaoming","age":23,"tel":10086},{"name":"xiaogang","age":32},{"name":"xiaohong","tel":10010}]
d2=pd.DataFrame(d2)
print(d2)
print(type(d2))
print("*"*50)
print(d2.index)
print(d2.columns)
print(d2.values)
print(d2.shape)
print(d2.dtypes)
print(d2.ndim)print("*"*50)运行结果：name  age    tel
0  xiaoming   23  10086
1  xiaogang   32  10010
<class 'pandas.core.frame.DataFrame'>
**************************************************name   age      tel
0  xiaoming  23.0  10086.0
1  xiaogang  32.0      NaN
2  xiaohong   NaN  10010.0
<class 'pandas.core.frame.DataFrame'>
**************************************************
RangeIndex(start=0, stop=3, step=1)
Index(['name', 'age', 'tel'], dtype='object')
[['xiaoming' 23.0 10086.0]['xiaogang' 32.0 nan]['xiaohong' nan 10010.0]]
(3, 3)
name     object
age     float64
tel     float64
dtype: object
2
**************************************************

练习

那么回到之前我们读取的狗名字统计的数据上，我们尝试一下刚刚的方法

那么问题来了：
很多同学肯定想知道使用次数最高的前几个名字是什么呢？

import pandas as pd
df=pd.read_csv("./dogNames2.csv")
print(df.head(5))
print("*"*50)
print(df.info())#DataFrame中排序的方法
df=df.sort_values(by="Count_AnimalName",ascending=False)
print(df.head(5))#pandas取行或者列的注意点
#-方括号写数字，表示取行，对行进行操作
#-写字符串，表示的列索引，对列进行操作
print(df[:3])
print("*"*50)
#print(df["Row_Labels"])
#print(type(df["Row_Labels"]))
#同时选择行和列
print("*"*50)
print(df[:6]["Count_AnimalName"])
print("*"*50)
#布尔索引
print(df[df["Count_AnimalName"]>800])
print(df[(df["Count_AnimalName"]>800)&(df["Count_AnimalName"]<1000)])
print("*"*50)
#假如我们想找到所有的使用次数超过700并且名字的字符串的长度大于4的狗的名字
print(df[(df["Row_Labels"].str.len()>4)&(df["Count_AnimalName"]>700)])

运行结果：

C:\ANACONDA\python.exe C:/Users/Lenovo/PycharmProjects/Code/day04/page113.pyRow_Labels  Count_AnimalName
0          1                 1
1          2                 2
2      40804                 1
3      90201                 1
4      90203                 1
**************************************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16220 entries, 0 to 16219
Data columns (total 2 columns):#   Column            Non-Null Count  Dtype
---  ------            --------------  ----- 0   Row_Labels        16217 non-null  object1   Count_AnimalName  16220 non-null  int64
dtypes: int64(1), object(1)
memory usage: 253.6+ KB
NoneRow_Labels  Count_AnimalName
1156       BELLA              1195
9140         MAX              1153
2660     CHARLIE               856
3251        COCO               852
12368      ROCKY               823Row_Labels  Count_AnimalName
1156      BELLA              1195
9140        MAX              1153
2660    CHARLIE               856
**************************************************
**************************************************
1156     1195
9140     1153
2660      856
3251      852
12368     823
8417      795
Name: Count_AnimalName, dtype: int64
**************************************************Row_Labels  Count_AnimalName
1156       BELLA              1195
9140         MAX              1153
2660     CHARLIE               856
3251        COCO               852
12368      ROCKY               823Row_Labels  Count_AnimalName
2660     CHARLIE               856
3251        COCO               852
12368      ROCKY               823
**************************************************Row_Labels  Count_AnimalName
1156       BELLA              1195
2660     CHARLIE               856
12368      ROCKY               823
8552       LUCKY               723Process finished with exit code 0

pandas之取行或者列

#pandas取行或者列的注意点
#-方括号写数字，表示取行，对行进行操作
#-写字符串，表示的列索引，对列进行操作

pandas之 loc与 iloc

还有更多的经过pandas优化过的选择方式：

df.loc 通过标签索引行数据
df.iloc 通过位置获取行数据

赋值更改数据的过程：

pandas之布尔索引

回到之前狗的名字的问题上，假如我们想找到所有的使用次数超过800的狗的名字，应该怎么选择？

假如我们想找到所有的使用次数超过700并且名字的字符串的长度大于4的狗的名字，应该怎么选择？

pandas之字符串方法

缺失数据的处理

对于NaN的数据，在numpy中我们是如何处理的？
在pandas中我们处理起来非常容易

判断数据是否为NaN：pd.isnull(df),pd.notnull(df)

处理方式1：删除NaN所在的行列dropna (axis=0, how=‘any’, inplace=False)
处理方式2：填充数据，t.fillna(t.mean()),t.fiallna(t.median()),t.fillna(0)

处理为0的数据：t[t==0]=np.nan
当然并不是每次为0的数据都需要处理
计算平均值等情况，nan是不参与计算的，但是0会

t3=pd.DataFrame(np.arange(12).reshape(3,4),index=list("ABC"),columns=list("WXYZ"))
print(t3)
print(t3.loc["A","Y"])
print(type(t3.loc["A","Y"]))
print(t3.loc["B",:])
print(type(t3.loc["B",:]))
print(t3.loc[:,"Z"])
print(type(t3.loc[:,"Z"]))print(t3.loc[["A","B"],["W","Z"]])
print(t3.loc["A":"C",["W","Z"]])print("*"*50)
print(t3)
print(t3.iloc[1])#取某一行
print(t3.iloc[:,2])#取某一列
print(t3.iloc[[0,2],[2,1]])
t3.iloc[1:2,0:2]=200
print(t3)
t3.loc["A":"C","Y"]=np.nan
print(t3)#缺失数据的处理
#判断是否为nan
print("*"*50)
print(pd.isnull(t3))
print(pd.notnull(t3))
#处理方式1：删除所在的行列，dropna(axis=0,how="any/all",inplace=False)
print(t3.dropna(axis=1,how="any"))
print(t3.dropna(axis=0,how="any"))
#处理方式2：填充数据，t.fillna(t.mean()),t.fiallna(t.median()),t.fillna(0)
print(d2)
print(d2.fillna(d2.mean()))
print("*"*50)
d2["age"]=d2["age"].fillna(d2["age"].mean())
print(d2)
d2["age"][2]=np.nan
print(d2)#处理为0的数据，t[t==0]=np.nan,当然并不是每次为0的数据都需要处理。计算平均值等情况，nan是不参与计算的，但是0会

运行结果：

   W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11
2
<class 'numpy.int32'>
W    4
X    5
Y    6
Z    7
Name: B, dtype: int32
<class 'pandas.core.series.Series'>
A     3
B     7
C    11
Name: Z, dtype: int32
<class 'pandas.core.series.Series'>W  Z
A  0  3
B  4  7W   Z
A  0   3
B  4   7
C  8  11
**************************************************W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11
W    4
X    5
Y    6
Z    7
Name: B, dtype: int32
A     2
B     6
C    10
Name: Y, dtype: int32Y  X
A   2  1
C  10  9W    X   Y   Z
A    0    1   2   3
B  200  200   6   7
C    8    9  10  11W    X   Y   Z
A    0    1 NaN   3
B  200  200 NaN   7
C    8    9 NaN  11
**************************************************W      X     Y      Z
A  False  False  True  False
B  False  False  True  False
C  False  False  True  FalseW     X      Y     Z
A  True  True  False  True
B  True  True  False  True
C  True  True  False  TrueW    X   Z
A    0    1   3
B  200  200   7
C    8    9  11
Empty DataFrame
Columns: [W, X, Y, Z]
Index: []name   age      tel
0  xiaoming  23.0  10086.0
1  xiaogang  32.0      NaN
2  xiaohong   NaN  10010.0name   age      tel
0  xiaoming  23.0  10086.0
1  xiaogang  32.0  10048.0
2  xiaohong  27.5  10010.0
**************************************************name   age      tel
0  xiaoming  23.0  10086.0
1  xiaogang  32.0      NaN
2  xiaohong  27.5  10010.0name   age      tel
0  xiaoming  23.0  10086.0
1  xiaogang  32.0      NaN
2  xiaohong   NaN  10010.0
C:/Users/Lenovo/PycharmProjects/Code/day04/test pandas_DataFrame.py:73: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copyd2["age"][2]=np.nanProcess finished with exit code 0

pandas常用统计方法

假设现在我们有一组从2006年到2016年1000部最流行的电影数据，我们想知道这些电影数据中评分的平均分，导演的人数等信息，我们应该怎么获取？

#!usr/bin/env python
# -*- coding:utf-8 _*-
import numpy as np
import pandas as pd#设置行不限制数量
#pd.set_option('display.max_rows',None)
#最后的的参数可以限制输出行的数量
#设置列不限制数量
#pd.set_option('display.max_columns',None)
#最后的的参数可以限制输出列的数量
#设置value的显示长度为80，默认为50
#pd.set_option('max_colwidth',80)#设置文件路径
file_path="IMDB-Movie-Data.csv"
#按照文件路径读取csv文件数据
df=pd.read_csv(file_path)
#print(df)
#print(df.info())
#print(df.head(1))#获取平均评分
#print(df["Rating"].mean())#获取导演的人数
print(df.head())
#print(df["Director"])
#set()函数创建一个无序不重复元素集,可进行关系测试,删除重复数据,还可以计算交集、差集、并集等。
print(len(set(df["Director"].tolist())))
print(len(df["Director"].unique()))#获取演员的人数
#print(df["Actors"])
temp_actors_list=df["Actors"].str.split(", ").tolist()  #str.split（","） 是对一个Series进行操作吗？  括号里的“，”起什么作用?
#print(temp_actors_list)
print(len(temp_actors_list))
acotrs_list=[i for j in temp_actors_list for i in j]
#print(acotrs_list)
print(len(acotrs_list))
acotrs_num=len(set(acotrs_list))
print(acotrs_num)#[i for j in temp_list for i in j]双重循环：如果是一个list，就一个for循环；如果是list嵌套list，就双重for循环，比如乘法口诀，比如矩阵。拆分后：
#for j in temp_list:
#for i in j:
#print(i)t1=["1",2,3,4,5,"6",7,{8},[9,10]]
t=[i for i in t1]
print(t)t2=[[1,2,3],[4,5,6],["7","8",{9}]]
t=[i for j in t2 for i in j]
print(t)#小型测试
a=pd.Series(["aaa","bbb","ccc","123","890","aaa","bbb","i, love, you","890"])
print(a)
print("*"*50)
b=a.str.split(",")
print(b)
c=a.str.split(",").tolist()
print(c)
d=[i for j in c for i in j]
print(d)
print("*"*50)
e=set(d)
print(e)
print(len(e))

运行结果：

C:\ANACONDA\python.exe C:/Users/Lenovo/PycharmProjects/Code/day04/page122.pyRank                    Title  ... Revenue (Millions) Metascore
0     1  Guardians of the Galaxy  ...             333.13      76.0
1     2               Prometheus  ...             126.46      65.0
2     3                    Split  ...             138.12      62.0
3     4                     Sing  ...             270.32      59.0
4     5            Suicide Squad  ...             325.02      40.0[5 rows x 12 columns]
644
644
1000
3650
2015
['1', 2, 3, 4, 5, '6', 7, {8}, [9, 10]]
[1, 2, 3, 4, 5, 6, '7', '8', {9}]
0             aaa
1             bbb
2             ccc
3             123
4             890
5             aaa
6             bbb
7    i, love, you
8             890
dtype: object
**************************************************
0               [aaa]
1               [bbb]
2               [ccc]
3               [123]
4               [890]
5               [aaa]
6               [bbb]
7    [i,  love,  you]
8               [890]
dtype: object
[['aaa'], ['bbb'], ['ccc'], ['123'], ['890'], ['aaa'], ['bbb'], ['i', ' love', ' you'], ['890']]
['aaa', 'bbb', 'ccc', '123', '890', 'aaa', 'bbb', 'i', ' love', ' you', '890']
**************************************************
{'ccc', 'aaa', 'i', ' love', '123', ' you', 'bbb', '890'}
8Process finished with exit code 0

动手

对于这一组电影数据，如果我们想rating，runtime的分布情况，应该如何呈现数据？

关于runtime

#!usr/bin/env python
# -*- coding:utf-8 _*-
import pandas as pd
from matplotlib import pyplot as plt#设置行不限制数量
pd.set_option('display.max_rows',None)
#最后的的参数可以限制输出行的数量
#设置列不限制数量
pd.set_option('display.max_columns',None)
#最后的的参数可以限制输出列的数量
#设置value的显示长度为80，默认为50
pd.set_option('max_colwidth',80)file_path="../day04/IMDB-Movie-Data.csv"
df=pd.read_csv(file_path)#print(df.head(3))
print(df.info())#rating,runtime分布情况
#选择图形，对连续数据的统计：直方图
#准备数据runtime_data=df["Runtime (Minutes)"].values #取Runtime (Minutes)这一列数据的值（该值为ndarray)max_runtime=runtime_data.max()
print(max_runtime)
min_runtime=runtime_data.min()
print(min_runtime)#计算组数
num_bin=(max_runtime-min_runtime)//5#开始绘制直方图
#设置图形dax
plt.figure(figsize=(16,8),dpi=80)plt.hist(runtime_data,num_bin)#设置x轴刻度
plt.xticks(range(min_runtime,max_runtime+5,5))plt.grid(True,linestyle="-",alpha=0.5)
plt.show()

运行结果：

C:\ANACONDA\python.exe C:/Users/Lenovo/PycharmProjects/Code/day05/page124.py
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):#   Column              Non-Null Count  Dtype
---  ------              --------------  -----  0   Rank                1000 non-null   int64  1   Title               1000 non-null   object 2   Genre               1000 non-null   object 3   Description         1000 non-null   object 4   Director            1000 non-null   object 5   Actors              1000 non-null   object 6   Year                1000 non-null   int64  7   Runtime (Minutes)   1000 non-null   int64  8   Rating              1000 non-null   float649   Votes               1000 non-null   int64  10  Revenue (Millions)  872 non-null    float6411  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB
None
191
66Process finished with exit code 0

关于Rating

#!usr/bin/env python
# -*- coding:utf-8 _*-import pandas as pd
from matplotlib import pyplot as plt#设置行不限制数量
pd.set_option('display.max_rows',None)
#最后的的参数可以限制输出行的数量
#设置列不限制数量
pd.set_option('display.max_columns',None)
#最后的的参数可以限制输出列的数量
#设置value的显示长度为80，默认为50
pd.set_option('max_colwidth',80)file_path="../day04/IMDB-Movie-Data.csv"
df=pd.read_csv(file_path)print(df.head(3))
print(df.info())#rating,runtime分布情况
#选择图形，对连续数据的统计：直方图
#准备数据Rating_data=df["Rating"].values #取Rating这一列数据的值（该值为ndarray)
#print(Rating_data)
max_Rating=Rating_data.max()
print(max_Rating)
min_Rating=Rating_data.min()
print(min_Rating)
print(max_Rating-min_Rating)#计算组数
#设置不等宽的组距，hist方法中取到的会是一个左闭右开的区间[1.9,3.5)，对于9.0的评分，需要取到9.0~9.5之间
num_bin_list=[1.9,3.5]
i=3.5
while i<=max_Rating:i+=0.5num_bin_list.append(i)
print(num_bin_list)#开始绘制直方图
#设置图形dax
plt.figure(figsize=(16,8),dpi=80)plt.hist(Rating_data,num_bin_list)#设置x轴刻度
plt.xticks(num_bin_list)plt.grid(True,linestyle="-",alpha=0.5)
plt.show()

运行结果：

C:\ANACONDA\python.exe C:/Users/Lenovo/PycharmProjects/Code/day05/page124_2.pyRank                    Title                     Genre  \
0     1  Guardians of the Galaxy   Action,Adventure,Sci-Fi
1     2               Prometheus  Adventure,Mystery,Sci-Fi
2     3                    Split           Horror,Thriller   Description  \
0  A group of intergalactic criminals are forced to work together to stop a fan...
1  Following clues to the origin of mankind, a team finds a structure on a dist...
2  Three girls are kidnapped by a man with a diagnosed 23 distinct personalitie...   Director  \
0          James Gunn
1        Ridley Scott
2  M. Night Shyamalan   Actors  \
0                     Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana
1  Noomi Rapace, Logan Marshall-Green, Michael Fassbender, Charlize Theron
2         James McAvoy, Anya Taylor-Joy, Haley Lu Richardson, Jessica Sula   Year  Runtime (Minutes)  Rating   Votes  Revenue (Millions)  Metascore
0  2014                121     8.1  757074              333.13       76.0
1  2012                124     7.0  485820              126.46       65.0
2  2016                117     7.3  157606              138.12       62.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):#   Column              Non-Null Count  Dtype
---  ------              --------------  -----  0   Rank                1000 non-null   int64  1   Title               1000 non-null   object 2   Genre               1000 non-null   object 3   Description         1000 non-null   object 4   Director            1000 non-null   object 5   Actors              1000 non-null   object 6   Year                1000 non-null   int64  7   Runtime (Minutes)   1000 non-null   int64  8   Rating              1000 non-null   float649   Votes               1000 non-null   int64  10  Revenue (Millions)  872 non-null    float6411  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB
None
9.0
1.9
7.1
[1.9, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5]Process finished with exit code 0

day4小结

机器学习-数据科学库-day4相关推荐

机器学习-数据科学库-day6
提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档机器学习-数据科学库-day6 pandas学习动手练习 pandas中的时间序列生成一段时间范围关于频率的更多缩写在Data ...
机器学习-数据科学库-day5
提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档机器学习-数据科学库-day5 pandas学习 pandas之DataFrame pandas常用统计方法将字符串离散化数据合并 ...
机器学习-数据科学库-day1
机器学习-数据科学库-day1 机器学习-数据科学库-day1 matplotlib 机器学习-数据科学库-day1 数据分析课程包括: 基础概念与环境 matplotlib numpy pandas ...
机器学习-数据科学库：matplotlib绘图
机器学习-数据科学库:matplotlib绘图 matplotlib绘图 matplotlib折线图 matplotlib散点图 matplotlib条形图 matplotlib直方图对比常用统计图 ...
机器学习-数据科学库：Pandas总结（1）
机器学习-数据科学库:Pandas总结(1) Pandas pandas的常用数据类型 pandas之Series创建 pandas之Series切片和索引 pandas之读取外部数据 pandas之 ...
HuaPu在学：机器学习——数据科学库【matplotlib】
数据科学库[matplotlib] 文章目录数据科学库[matplotlib] 前言一.数据分析介绍及环境安装 1.为什么要数据分析??? 2.环境安装二.matplotlib 1.为什么学习m ...
机器学习-数据科学库（第三天）
14.numpy的数组的创建什么是numpy 一个在Python中做科学计算的基础库,重在数值计算,也是大部分PYTHON科学计算库的基础库,多用于在大型.多维数组上执行数值运算(数组就是列表.列表 ...
机器学习-数据科学库（第六天）
37.pandas时间序列01 现在我们有2015到2017年25万条911的紧急电话的数据,请统计出出这些数据中不同类型的紧急情况的次数,如果我们还想统计出不同月份不同类型紧急电话的次数的变化情况, ...
机器学习-数据科学库（第五天）
31.数据的合并和分组聚合--字符串离散化的案例字符串离散化的案例刚刚我们学会了数据分合并,那么接下来,我们按照电影分类(genre)信息把数据呈现出来 import numpy as np im ...
机器学习-数据科学库（第四天）
23.pandas的series的了解为什么要学习pandas numpy能够帮助我们处理数值,但是pandas除了处理数值之外(基于numpy),还能够帮助我们处理其他类型的数据 pandas的常 ...

机器学习-数据科学库-day4

机器学习-数据科学库-day4

pandas学习

pandas基础

pandas的常用数据类型

pandas之Series创建

pandas之Series切片和索引

pandas之Series的索引和值

练习

pandas之读取外部数据

pandas之DataFrame

练习

pandas之取行或者列

pandas之 loc与 iloc

pandas之布尔索引

pandas之字符串方法

缺失数据的处理

pandas常用统计方法

动手

day4小结

机器学习-数据科学库-day4相关推荐

最新文章

热门文章