【CookBook pandas】学习笔记第五章 Exploratory Data Analysis
dive more into … 深入讨论
exploratory data analysis , the process of sifting through the data and trying to make sense of the individual columns and the relationships between them.
literally 简直,差不多
what is ‘parsing dates’……
divine more about the data
object types may be strings or categorical data, but they could also be numeric-like value that need to be nudged a little so that they are numeric.
In[2]: import pandas as pd
In[3]: import numpy as np
In[4]: fueleco=pd.read_csv("vehicles.csv",nrows=3)
In[5]: fueleco.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 83 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 barrels08 3 non-null float641 barrelsA08 3 non-null float642 charge120 3 non-null float643 charge240 3 non-null float644 city08 3 non-null int64 5 city08U 3 non-null float646 cityA08 3 non-null int64 7 cityA08U 3 non-null float648 cityCD 3 non-null float649 cityE 3 non-null float6410 cityUF 3 non-null float6411 co2 3 non-null int64 12 co2A 3 non-null int64 13 co2TailpipeAGpm 3 non-null float6414 co2TailpipeGpm 3 non-null float6415 comb08 3 non-null int64 16 comb08U 3 non-null float6417 combA08 3 non-null int64 18 combA08U 3 non-null float6419 combE 3 non-null float6420 combinedCD 3 non-null float6421 combinedUF 3 non-null float6422 cylinders 3 non-null int64 23 displ 3 non-null float6424 drive 3 non-null object 25 engId 3 non-null int64 26 eng_dscr 3 non-null object 27 feScore 3 non-null int64 28 fuelCost08 3 non-null int64 29 fuelCostA08 3 non-null int64 30 fuelType 3 non-null object 31 fuelType1 3 non-null object 32 ghgScore 3 non-null int64 33 ghgScoreA 3 non-null int64 34 highway08 3 non-null int64 35 highway08U 3 non-null float6436 highwayA08 3 non-null int64 37 highwayA08U 3 non-null float6438 highwayCD 3 non-null float6439 highwayE 3 non-null float6440 highwayUF 3 non-null float6441 hlv 3 non-null int64 42 hpv 3 non-null int64 43 id 3 non-null int64 44 lv2 3 non-null int64 45 lv4 3 non-null int64 46 make 3 non-null object 47 model 3 non-null object 48 mpgData 3 non-null object 49 phevBlended 3 non-null bool 50 pv2 3 non-null int64 51 pv4 3 non-null int64 52 range 3 non-null int64 53 rangeCity 3 non-null float6454 rangeCityA 3 non-null float6455 rangeHwy 3 non-null float6456 rangeHwyA 3 non-null float6457 trany 3 non-null object 58 UCity 3 non-null float6459 UCityA 3 non-null float6460 UHighway 3 non-null float6461 UHighwayA 3 non-null float6462 VClass 3 non-null object 63 year 3 non-null int64 64 youSaveSpend 3 non-null int64 65 guzzler 1 non-null object 66 trans_dscr 1 non-null object 67 tCharger 0 non-null float6468 sCharger 0 non-null float6469 atvType 0 non-null float6470 fuelType2 0 non-null float6471 rangeA 0 non-null float6472 evMotor 0 non-null float6473 mfrCode 0 non-null float6474 c240Dscr 0 non-null float6475 charge240b 3 non-null float6476 c240bDscr 0 non-null float6477 createdOn 3 non-null object 78 modifiedOn 3 non-null object 79 startStop 0 non-null float6480 phevCity 3 non-null int64 81 phevHwy 3 non-null int64 82 phevComb 3 non-null int64
dtypes: bool(1), float64(41), int64(28), object(13)
memory usage: 2.0+ KB
In[6]: fueleco=pd.read_csv("vehicles.csv")
D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py:3155: DtypeWarning: Columns (70,71,72,73,74,76,79) have mixed types.Specify dtype option on import or set low_memory=False.has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
In[7]: fueleco=pd.read_csv("vehicles.csv",usecols=list(range(0,70,1)))
In[8]: fueleco.mean()
Out[8]:
barrels08 17.442712
barrelsA08 0.219276
charge120 0.000000
charge240 0.029630
city08 18.077799
city08U 5.040648
cityA08 0.569883
cityA08U 0.416097
cityCD 0.000560
cityE 0.225181
cityUF 0.000975
co2 72.538989
co2A 5.543950
co2TailpipeAGpm 17.826864
co2TailpipeGpm 470.704841
comb08 20.323828
comb08U 5.652724
combA08 0.631160
combA08U 0.453725
combE 0.230912
combinedCD 0.000459
combinedUF 0.000959
cylinders 5.729105
displ 3.309829
engId 8582.377382
feScore 0.122580
fuelCost08 2242.470781
fuelCostA08 91.335260
ghgScore 0.120866
ghgScoreA -0.923889
highway08 24.208588
highway08U 6.712736
highwayA08 0.736452
highwayA08U 0.523423
highwayCD 0.000343
highwayE 0.238526
highwayUF 0.000938
hlv 2.029539
hpv 10.411243
id 19662.541188
lv2 1.834812
lv4 6.155930
phevBlended 0.001458
pv2 13.649574
pv4 33.883711
range 0.500243
rangeCity 0.458375
rangeCityA 0.050978
rangeHwy 0.450392
rangeHwyA 0.046958
UCity 22.789421
UCityA 0.723139
UHighway 33.884375
UHighwayA 1.009562
year 2000.635406
youSaveSpend -3459.572645
dtype: float64
In[9]: fueleco.describe(include='object')
Out[9]: drive eng_dscr fuelType ... tCharger sCharger atvType
count 37912 23431 39101 ... 5816 738 3204
unique 7 545 14 ... 1 1 8
top Front-Wheel Drive (FFS) Regular ... T S FFV
freq 13653 8827 25620 ... 5816 738 1383
[4 rows x 14 columns]
In[10]: fueleco.make.value_counts()
Out[10]:
Chevrolet 3900
Ford 3208
Dodge 2557
GMC 2442
Toyota 1976...
Shelby 1
Grumman Allied Industries 1
Qvale 1
Volga Associated Automobile 1
Goldacre 1
Name: make, Length: 134, dtype: int64
In[11]: fueleco.make.nunique()
Out[11]: 134
In[12]: fueleco.select_dtypes("int64")
Out[12]: city08 cityA08 co2 co2A comb08 ... pv2 pv4 range year youSaveSpend
0 19 0 -1 -1 21 ... 0 0 0 1985 -1750
1 9 0 -1 -1 11 ... 0 0 0 1985 -10500
2 23 0 -1 -1 27 ... 0 0 0 1985 250
3 10 0 -1 -1 11 ... 0 0 0 1985 -10500
4 17 0 -1 -1 19 ... 0 90 0 1993 -4750... ... ... ... ... ... ... ... ... ... ...
39096 19 0 -1 -1 22 ... 0 90 0 1993 -1500
39097 20 0 -1 -1 23 ... 0 90 0 1993 -1000
39098 18 0 -1 -1 21 ... 0 90 0 1993 -1750
39099 18 0 -1 -1 21 ... 0 90 0 1993 -1750
39100 16 0 -1 -1 18 ... 0 90 0 1993 -5500
[39101 rows x 24 columns]
In[13]: fueleco.select_dtypes("int64").describe().T
Out[13]: count mean std ... 50% 75% max
city08 39101.0 18.077799 6.970672 ... 17.0 20.0 150.0
cityA08 39101.0 0.569883 4.297124 ... 0.0 0.0 145.0
co2 39101.0 72.538989 163.252019 ... -1.0 -1.0 847.0
co2A 39101.0 5.543950 55.956932 ... -1.0 -1.0 713.0
comb08 39101.0 20.323828 6.882807 ... 20.0 23.0 136.0
combA08 39101.0 0.631160 4.395797 ... 0.0 0.0 133.0
engId 39101.0 8582.377382 17606.675590 ... 202.0 4401.0 69102.0
feScore 39101.0 0.122580 2.516348 ... -1.0 -1.0 10.0
fuelCost08 39101.0 2242.470781 601.273869 ... 2250.0 2500.0 6850.0
fuelCostA08 39101.0 91.335260 479.485802 ... 0.0 0.0 3850.0
ghgScore 39101.0 0.120866 2.512612 ... -1.0 -1.0 10.0
ghgScoreA 39101.0 -0.923889 0.651017 ... -1.0 -1.0 8.0
highway08 39101.0 24.208588 7.128070 ... 24.0 27.0 122.0
highwayA08 39101.0 0.736452 4.694207 ... 0.0 0.0 121.0
hlv 39101.0 2.029539 5.959735 ... 0.0 0.0 49.0
hpv 39101.0 10.411243 28.167271 ... 0.0 0.0 195.0
id 39101.0 19662.541188 11413.329199 ... 19552.0 29555.0 39483.0
lv2 39101.0 1.834812 4.407887 ... 0.0 0.0 41.0
lv4 39101.0 6.155930 9.698101 ... 0.0 13.0 55.0
pv2 39101.0 13.649574 31.214466 ... 0.0 0.0 194.0
pv4 39101.0 33.883711 45.991687 ... 0.0 91.0 192.0
range 39101.0 0.500243 9.742080 ... 0.0 0.0 335.0
year 39101.0 2000.635406 10.690422 ... 2001.0 2010.0 2018.0
youSaveSpend 39101.0 -3459.572645 3010.284617 ... -3500.0 -1500.0 5250.0
[24 rows x 8 columns]
In[14]: # iinfo function in numpy will show the limit for integer types
In[15]: np.iinfo(np.int8)
Out[15]: iinfo(min=-128, max=127, dtype=int8)
In[16]: np.iinfo(int16)
Traceback (most recent call last):File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_codeexec(code_obj, self.user_global_ns, self.user_ns)File "<ipython-input-16-33fbc0c72155>", line 1, in <module>np.iinfo(int16)
NameError: name 'int16' is not defined
In[17]: np.iinfo(np.int16)
Out[17]: iinfo(min=-32768, max=32767, dtype=int16)
In[18]: # 'cit08' and 'comb08' don't go above up to 150
In[19]: fueleco[['city08','comb08']].astype('int16').info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 2 columns):# Column Non-Null Count Dtype
--- ------ -------------- -----0 city08 39101 non-null int161 comb08 39101 non-null int16
dtypes: int16(2)
memory usage: 152.9 KB
In[20]: fueleco['city08','comb08'].info(memory_usage='deep')
Traceback (most recent call last):File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_locreturn self._engine.get_loc(casted_key)File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_locFile "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_locFile "pandas\_libs\hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_itemFile "pandas\_libs\hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: ('city08', 'comb08')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_codeexec(code_obj, self.user_global_ns, self.user_ns)File "<ipython-input-20-e5f2d55949d1>", line 1, in <module>fueleco['city08','comb08'].info(memory_usage='deep')File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__indexer = self.columns.get_loc(key)File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_locraise KeyError(key) from err
KeyError: ('city08', 'comb08')
In[21]: fueleco[['city08','comb08']].info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 2 columns):# Column Non-Null Count Dtype
--- ------ -------------- -----0 city08 39101 non-null int641 comb08 39101 non-null int64
dtypes: int64(2)
memory usage: 611.1 KB
In[22]: # so just modify the type of 'city08' and 'comb08' into 'int16'
In[23]: fueleco[['city08','comb08']].assign()
Out[23]: city08 comb08
0 19 21
1 9 11
2 23 27
3 10 11
4 17 19... ...
39096 19 22
39097 20 23
39098 18 21
39099 18 21
39100 16 18
[39101 rows x 2 columns]
In[24]: (fueleco[['city08','comb08']]...: .assin(city08=fueleco.city08.astype(np.int16),...: comb08=fueleco.comb08.astype(np.in16)...: )...: .info(memory_usage='deep')...: )
Traceback (most recent call last):File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_codeexec(code_obj, self.user_global_ns, self.user_ns)File "<ipython-input-24-15d636284b16>", line 1, in <module>(fueleco[['city08','comb08']]File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\generic.py", line 5462, in __getattr__return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'assin'
In[25]: (fueleco[['city08','comb08']]...: .assign(city08=fueleco.city08.astype(np.int16),...: comb08=fueleco.comb08.astype(np.in16)...: )...: .info(memory_usage='deep')...: )
Traceback (most recent call last):File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_codeexec(code_obj, self.user_global_ns, self.user_ns)File "<ipython-input-25-fccec6400831>", line 3, in <module>comb08=fueleco.comb08.astype(np.in16)File "D:\PyCharm2020\python2020\lib\site-packages\numpy\__init__.py", line 214, in __getattr__raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'in16'
In[26]: (fueleco[['city08','comb08']]...: .assign(city08=fueleco.city08.astype('int16'),...: comb08=fueleco.comb08.astype('in16')...: )...: .info(memory_usage='deep')...: )
Traceback (most recent call last):File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_codeexec(code_obj, self.user_global_ns, self.user_ns)File "<ipython-input-26-fe1ee5143ca8>", line 3, in <module>comb08=fueleco.comb08.astype('in16')File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\generic.py", line 5874, in astypenew_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\internals\managers.py", line 631, in astypereturn self.apply("astype", dtype=dtype, copy=copy, errors=errors)File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\internals\managers.py", line 427, in applyapplied = getattr(b, f)(**kwargs)File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\internals\blocks.py", line 626, in astypedtype = pandas_dtype(dtype)File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\dtypes\common.py", line 1799, in pandas_dtypenpdtype = np.dtype(dtype)
TypeError: data type 'in16' not understood
In[27]: (fueleco[['city08','comb08']]...: .assign(city08=fueleco.city08.astype(np.int16),...: comb08=fueleco.comb08.astype(np.int16)...: )...: .info(memory_usage='deep')...: )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 2 columns):# Column Non-Null Count Dtype
--- ------ -------------- -----0 city08 39101 non-null int161 comb08 39101 non-null int16
dtypes: int16(2)
memory usage: 152.9 KB
In[28]: fueleco['make','model'].nunique()
Traceback (most recent call last):File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_locreturn self._engine.get_loc(casted_key)File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_locFile "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_locFile "pandas\_libs\hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_itemFile "pandas\_libs\hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: ('make', 'model')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_codeexec(code_obj, self.user_global_ns, self.user_ns)File "<ipython-input-28-bd185d23b85a>", line 1, in <module>fueleco['make','model'].nunique()File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__indexer = self.columns.get_loc(key)File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_locraise KeyError(key) from err
KeyError: ('make', 'model')
In[29]: fueleco[['make','model']].nunique()
Out[29]:
make 134
model 3816
dtype: int64
In[30]: # 'make' has a low cardinality, so convert it into 'category' for memory reusageIn[33]: (...: fueleco[['make']]...: .assign(make=fueleco.make.astype('category')...: ).info(memory_usage='deep'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 1 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 make 39101 non-null category
dtypes: category(1)
memory usage: 89.5 KB
In[34]: (fueleco[['model']]...: .assign(model=fueleco.model.astype('category'))...: .info(memory_usage='deep'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 1 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 model 39101 non-null category
dtypes: category(1)
memory usage: 465.7 KBIn[36]: fueleco.make.value_counts()
Out[36]:
Chevrolet 3900
Ford 3208
Dodge 2557
GMC 2442
Toyota 1976...
Shelby 1
Grumman Allied Industries 1
Qvale 1
Volga Associated Automobile 1
Goldacre 1
Name: make, Length: 134, dtype: int64In[37]: # there are so many values in the above summary, now look at the top 6 and collapse the remaining values
In[38]: top_n=fueleco.make.value_counts().index[:6]
In[39]: fueleco.value_counts()
Out[39]: Series([], dtype: int64)
In[40]: fueleco.make.value_counts()
Out[40]:
Chevrolet 3900
Ford 3208
Dodge 2557
GMC 2442
Toyota 1976...
Shelby 1
Grumman Allied Industries 1
Qvale 1
Volga Associated Automobile 1
Goldacre 1
Name: make, Length: 134, dtype: int64In[42]: top_n
Out[42]: Index(['Chevrolet', 'Ford', 'Dodge', 'GMC', 'Toyota', 'BMW'], dtype='object')In[44]: (fueleco.assign(...: make=fueleco.make.where(fueleco.make.isin(top_n),"other"))...: .make.value_counts())
Out[44]:
other 23211
Chevrolet 3900
Ford 3208
Dodge 2557
GMC 2442
Toyota 1976
BMW 1807
Name: make, dtype: int64
In[45]: # determine the number and percent of missing values
In[46]: fueleco.drive.isna().sum()
Out[46]: 1189
In[47]: fueleco.isna().mean()
Out[47]:
barrels08 0.000000
barrelsA08 0.000000
charge120 0.000000
charge240 0.000000
city08 0.000000...
guzzler 0.940283
trans_dscr 0.615176
tCharger 0.851257
sCharger 0.981126
atvType 0.918058
Length: 70, dtype: float64
In[48]: fueleco.drive.isna().mean()
Out[48]: 0.030408429451932176
In[49]: fueleco.drive.isna().mean()*100
Out[49]: 3.0408429451932175
In[50]: # use .nunique method to determine cardinality
In[51]: fueleco.drive.nunique()
Out[51]: 7
In[52]: # pick out the columns with data types that are object
In[53]: fueleco.select_dtypes('object').columns
Out[53]:
Index(['drive', 'eng_dscr', 'fuelType', 'fuelType1', 'make', 'model','mpgData', 'trany', 'VClass', 'guzzler', 'trans_dscr', 'tCharger','sCharger', 'atvType'],dtype='object')In[55]: import matplotlib.pyplot as plt
Backend TkAgg is interactive backend. Turning interactive mode on.
fig,ax=plt.subplots(figsize=(10,8))
top_n=fueleco.make.value_counts().index[:6]
(fueleco.assign(make=fueleco.make.where(fueleco.make.isin(top_n),'Other')).make.value_counts().plot.bar(ax=ax))
Out[58]: <AxesSubplot:>
fig.savefig("c5-catpan.png",dpi=300)
.cut
.qcut
(quantile cut) used to cut into equal-width bins or bin width that we specify, with these methods we can treat numeric columns as categories by binning them .
continuous data
import seaborn as sns
fig,ax=plt.subplots(figsize=(10,8))
sns.countplot(y='make',data=(fueleco.assign(make=fueleco.make.where(fueleco.make.isin(top_n),"Other"))))
Out[66]: <AxesSubplot:xlabel='count', ylabel='make'>
fig.savefig("c5-catsns.png",dpi=300)# rows where 'drive' are missing
fueleco[fueleco.drive.isna()]
Out[69]: barrels08 barrelsA08 charge120 ... tCharger sCharger atvType
7138 0.240000 0.0 0.0 ... NaN NaN EV
8144 0.312000 0.0 0.0 ... NaN NaN EV
8147 0.270000 0.0 0.0 ... NaN NaN EV
18215 15.695714 0.0 0.0 ... NaN NaN NaN
18216 14.982273 0.0 0.0 ... NaN NaN NaN... ... ... ... ... ... ...
23023 0.240000 0.0 0.0 ... NaN NaN EV
23024 0.546000 0.0 0.0 ... NaN NaN EV
23026 0.426000 0.0 0.0 ... NaN NaN EV
23031 0.426000 0.0 0.0 ... NaN NaN EV
23034 0.204000 0.0 0.0 ... NaN NaN EV
[1189 rows x 70 columns]# by default, .value_counts does not show missing values, but use dropna parameter
fueleco.drive.value_counts(dropna=False)
Out[71]:
Front-Wheel Drive 13653
Rear-Wheel Drive 13284
4-Wheel or All-Wheel Drive 6648
All-Wheel Drive 2401
4-Wheel Drive 1221
NaN 1189
2-Wheel Drive 507
Part-time 4-Wheel Drive 198
Name: drive, dtype: int64fig,ax=plt.subplots(figsize=(10,8))
fueleco.drive.value_counts(dropna=False).plot.bar(ax=ax)
Out[73]: <AxesSubplot:>
fig.savefig('c5-aa',dpi=300)",".join('abcd')
Out[76]: 'a,b,c,d'fueleco.city08.sample(5,random_state=40)
Out[80]:
4643 16
1483 15
34149 21
563 14
2364 19
Name: city08, dtype: int64
fueleco.city08.sample(5,random_state=42)
Out[81]:
4217 11
1736 21
36029 16
37631 16
1668 17
Name: city08, dtype: int64
# use pandas to plot a histogram
import matplotlib.pyplot as plt
fig,ax=plt.subplots(figsize=(10,8))
fueleco.city08.hist(ax=ax)
Out[85]: <AxesSubplot:>
fig.savefig('hist.png',dpi=300)
# the plot looks very skewed, so increase the number of bins in the histogram to see if the skew is hiding behaviors
# as the skew makes bins wider
fig,ax=plt.subplots(figsize=(10,8))
fueleco.city08.hist(ax=ax,bins=30)
Out[90]: <AxesSubplot:>
fig.savefig("hist-bins-30.png",dpi=300)
# use seaborn to create a distribution plot,which includes a histogram, a kernel density estimation(KDE), a rug plot
sns.distplot(fueleco.city08,rug=True,ax=ax)
D:\PyCharm2020\python2020\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).warnings.warn(msg, FutureWarning)
D:\PyCharm2020\python2020\lib\site-packages\seaborn\distributions.py:2056: FutureWarning: The `axis` variable is no longer used and will be removed. Instead, assign variables directly to `x` or `y`.warnings.warn(msg, FutureWarning)
Out[96]: <AxesSubplot:xlabel='city08', ylabel='Density'>
fig.savefig('rugplot.png',dpi=300)
[160(181/627)]
关于seaborn绘图函数:boxplot,boxenplot,violin plots的介绍
【162(183/627)】
用图形检验数据是否服从正态分布:
fig.savefig('rugplot.png',dpi=300)
fig,ax=plt.subplots(nrows=3,figsize=(10,8))
sns.boxplot(fueleco.city08,ax=ax[0])
sns.violinplot(fueleco.city08,ax=ax[1])
sns.boxenplot(fueleco.city08,ax=ax[2])
fig.savefig('subplots_nroes3.png',dpi=300)
from scipy import stats
stats.kstest(fueleco.city08,cdf='norm')
Out[104]: KstestResult(statistic=0.9999999990134123, pvalue=0.0)
fig,ax=plt.subplots(figsize=(10,8))
stats.probplot(fueleco.city08,plot=ax)
Out[106]:
((array([-4.1352692 , -3.92687024, -3.81314873, ..., 3.81314873,3.92687024, 4.1352692 ]),array([ 6, 6, 6, ..., 137, 138, 150], dtype=int64)),(5.385946629915974, 18.077798521776934, 0.772587941459713))
fig.savefig('proplor.png',dpi=300)
Comparing continuous values across categories
# make a mask for the brands we want
mask=fueleco.make.isin(["Ford","Honda","Tesla","BMW"])
mask
Out[112]:
0 False
1 False
2 False
3 False
4 False...
39096 False
39097 False
39098 False
39099 False
39100 False
Name: make, Length: 39101, dtype: bool
type(mask)
Out[113]: pandas.core.series.Series
fueleco[mask]
Out[114]: barrels08 barrelsA08 charge120 ... tCharger sCharger atvType
20 20.600625 0.0 0.0 ... NaN NaN NaN
21 20.600625 0.0 0.0 ... NaN NaN NaN
22 25.354615 0.0 0.0 ... NaN NaN NaN
56 15.695714 0.0 0.0 ... NaN NaN NaN
57 17.347895 0.0 0.0 ... NaN NaN NaN... ... ... ... ... ... ...
39016 13.733750 0.0 0.0 ... NaN NaN NaN
39017 17.347895 0.0 0.0 ... NaN NaN NaN
39018 15.695714 0.0 0.0 ... NaN NaN NaN
39023 14.982273 0.0 0.0 ... NaN NaN NaN
39025 13.733750 0.0 0.0 ... NaN NaN NaN
[5986 rows x 70 columns]
fueleco[mask].groupby("make").city08.agg(["mean","std"])
Out[115]: mean std
make
BMW 17.817377 7.372907
Ford 16.853803 6.701029
Honda 24.372973 9.154064
Tesla 92.826087 5.538970
# and then use a group by operation to look at the mean and std for the city08 column
# visualize the city08 values for each make with seborn
g=sns.catplot(x="make",y="city08",data=fueleco[mask],kind='box')
g.ax.figure.savefig("c5-catbox.png",dpi=300)
# one of drawback of a boxplot is that while it indicates the spread of the data, it does not reveal how many samples are in each make
mask=fueleco.make.isin(["Ford","Honda","Tesla","BMW"]) # 布尔型向量,其分量对应fueleco的每一行的make标签是否在特定的这四个元素组成的数组中
fueleco[mask].groupby("make").city08.count()
Out[123]:
make
BMW 1807
Ford 3208
Honda 925
Tesla 46
Name: city08, dtype: int64
# faceet the grid by another feature
# break each of these new plot into its own graph by using the col parameter
g=sns.catplot(x="make",y="city08",data=fueleco[mask],kind='box',col='year',col_order=[2012,2014,2016,2018])
g=sns.catplot(x="make",y="city08",data=fueleco[mask],kind='box',col='year',col_order=[2012,2014,2016,2018],col_wrap=2)
# col为划分依据,col_order是小窗顺序
# embed the new dimension in the same plot by using the hue parameter
g=sns.catplot(x="make",y="")
Comparing two continuous columns
# Comparing two continuous columns
# if you have two columnswith a high correlation to one another, often , you may drop one of them as a redundant column
# covariance of the two numbers if they are on the same scale
fueleco.city08.cov(fueleco.highway08)
Traceback (most recent call last):File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_codeexec(code_obj, self.user_global_ns, self.user_ns)File "<ipython-input-5-be0843446ebc>", line 1, in <module>fueleco.city08.cov(fueleco.highway08)
NameError: name 'fueleco' is not defined
import pandas as pd
import numpy as np
fueleco=pd.read_csv("vehicles.csv",usecols=list(range(1,70,1)))
fueleco.city08.cov(fueleco.highway08)
Out[9]: 46.33326023673624
fueleco.city08.cov(fueleco.comb08)
Out[10]: 47.419946678190776
fueleco.city08.cov(fueleco.cylinders)
Out[11]: -5.931560263764768
# Pearson correlation between the two numbers
fueleco.city08.corr(fueleco.highway08)
Out[13]: 0.932494506228495
# use pandas to scatter plot the relationship
import matplotlib.pyplot as plt
Backend TkAgg is interactive backend. Turning interactive mode on.
fig,ax=plt.subplots(figsize=(8,8))
fueleco.plot.scatter(x="city08",y="highway08",ax=ax)
Out[17]: <AxesSubplot:xlabel='city08', ylabel='highway08'>
fueleco.plot.scatter(x="city08",y="highway08",ax=ax,alpha=0.01)
Out[18]: <AxesSubplot:xlabel='city08', ylabel='highway08'>
fig.savefig('scatterplot_alpha0.01.png',dpi=300)
fig,ax=plt.subplots(figsize=(8,8))
fueleco.plot.scatter(x="city08",y="highway08",ax=ax,alpha=0.01)
Out[21]: <AxesSubplot:xlabel='city08', ylabel='highway08'>
fueleco.plot.scatter(x="city08",y="highway08",ax=ax,alpha=0.1)
# pearson correlation is intended to show the strength of a linear relationship.
# if the continuous columns columns do not have a linear relationship, another option is use Spearman correlation
# this number also varies from -1 to 1
# it measrues whether the relationship is monotonic and doesn't presume that it is linear
# it use the rank of each number rather than the number if you are not sure whther there is a linear relationship between your coulmns, this is a better metric to use
fueleco.city08.corr(fueleco.barrelsA08,method='spearman')
Out[31]: -0.08476703673460519
# Pearson correlation tells us how one value impacts another
# covariance lets us know how these values vary together
# a heatmap is a great way to look at correlations in aggregate
# scatter plots are another way to visualize the relationships between continuous variables
# set alpha parameter to a value less than or equal to 0.5, which makes the points transparent
# now , add more dimension to a scatter plot
# using the replot function, we can color to the dotd by year aand size them by the number of barrals the vehicles consumes
# in this case, go from 2 dimension to 4dimensions
res=sns.relplot(x="city08",y="highway08",data=fueleco.assign(cylinders=fueleco.cylinders.fillna(0)),hue='year',size='barrelsA08',alpha=0.5,height=8)
res=sns.relplot(x="city08",y="highway08",data=fueleco.assign(cylinders=fueleco.cylinders.fillna(0)),hue='year',size='barrelsA08',alpha=0.5,height=8,col='make',col_order=['Ford',"Tesla"])
Comparing categorical vaules with categorical values
# continuous columns can be converted into categorical columns by binning the values
[179(200/627)] 没大看懂这是在做什么……
# if you use seaborn, you can add multiple dimensions by setting 'hue' or 'col'
Using the pandas profiling library
[187/(208/627)]
【CookBook pandas】学习笔记第五章 Exploratory Data Analysis相关推荐
- IBM Machine Learning学习笔记(一)——Exploratory Data Analysis for Machine Learning
数据的探索性分析 1. 读入数据 (1)csv文件读取 (2)json文件读取 (3)SQL数据库读取 (4)Not-only SQL (NoSQL)读取 (5)从网络中获取 2. 数据清洗 (1)缺 ...
- Programming Entity Framework-dbContext 学习笔记第五章
### Programming Entity Framework-dbContext 学习笔记 第五章 将图表添加到Context中的方式及容易出现的错误 方法 结果 警告 Add Root 图标中的 ...
- 《Go语言圣经》学习笔记 第五章函数
<Go语言圣经>学习笔记 第五章 函数 目录 函数声明 递归 多返回值 匿名函数 可变参数 Deferred函数 Panic异常 Recover捕获异常 注:学习<Go语言圣经> ...
- 2022 最新 Android 基础教程,从开发入门到项目实战【b站动脑学院】学习笔记——第五章:中级控件
第 5 章 中级控件 本章介绍App开发常见的几类中级控件的用法,主要包括:如何定制几种简单的图形.如何使用几种选择按钮.如何高效地输入文本.如何利用对话框获取交互信息等,然后结合本章所学的知识,演示 ...
- 《Android深度探究HAL与驱动开发》学习笔记----第五章
第五章 搭建S3C6410开发板的测试环境 开发板是开发和学习嵌入式技术的主要硬件设备. 主要学习了搭建S3C6410开发板的测试环境.首先要了解到S3C6410是一款低功耗.高性价比的RISC处理器 ...
- muduo学习笔记 - 第五章 高效的多线程日志
第五章 高效的多线程日志 日志有两种意思: 诊断日志 交易日志 本章讲的是前一种日志,文本的供人阅读的日志,通常用于故障诊断和追踪,也可用于性能分析. 日志通常要记录: 收到的每条消息的id(关键字段 ...
- javascript高级程序设计 学习笔记 第五章 上
第五章 引用类型的值(对象)是引用类型的一个实例.在 ECMAScript 中,引用类型是一种数据结构, 用于将数据和功能组织在一起.它也常被称为类,但这种称呼并不妥当.尽管 ECMAScri ...
- css层叠样式表基础学习笔记--第五章 文本属性
第五章 文本属性 5-01 字间距 5-02 行高 5-03 首行缩进 5-04 水平排列方式 5-05 垂直对齐方式 5-06 文本修饰 5-07 文本阴影 5-08 文本属性重置 5-01 字间距 ...
- 程序设计与算法三~C++面向对象程序设计~北大郭炜MOOC学习笔记~第五章:继承与派生(新标准C++程序设计)
以下内容为笔者手打,望读者珍惜,如有转载还请注明. 第五章 继承与派生 $5.1 继承与派生的概念 $5.1.1 基本概念 在C++中,当定义一个新的类B时,如果发现类B拥有某个已经写好的类A ...
最新文章
- arduino 控制无刷电机_智能控制轮椅来了,残疾人的福音!
- SAP ABAP SQL查询分析器
- oracle 找不到程序单元,Oracle Web ADI 加载时错误:ORA-06508: PL/SQL: 无法在调用之前找到程序单元...
- NetBean中的使用,比如快捷键
- Struts2使用!动态方法调用无效
- js (jQuery)分组数据
- bert预训练模型解读_超越谷歌BERT!依图预训练语言理解模型入选NeurIPS
- 将堆栈异常返回前端显示
- OPEN×××拨入后给不同的用户分配不同的访问权限
- 2、如何利用CommMonitor串口监控,抓取串口Modbus RTU数据包
- 基于stm32的银行排队叫号机设计
- php银联支付接口 demo,php版银联支付接口的开发
- 戴尔服务器r720怎么进入系统安装,戴尔R720服务器安装步骤.doc
- java mvp模式_什么是mvp开发模式?(下面就对Android中MVP做一些阐述)
- 高精度阶乘和 高精度算法(c语言)
- 深度学习(6):图像超分辨率(Image Super Resolution)重建
- iOS xcode9中framework静态库的创建以及xib和图片的使用记录
- pm2日志切割 - pm2-logrotate
- 输油管道问题-分治法求解
- 【名词】什么是PV和UV?