dive more into … 深入讨论
exploratory data analysis , the process of sifting through the data and trying to make sense of the individual columns and the relationships between them.
literally 简直，差不多

what is ‘parsing dates’……

divine more about the data

object types may be strings or categorical data, but they could also be numeric-like value that need to be nudged a little so that they are numeric.

In[2]: import pandas as pd
In[3]: import numpy as np
In[4]: fueleco=pd.read_csv("vehicles.csv",nrows=3)
In[5]: fueleco.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 83 columns):#   Column           Non-Null Count  Dtype
---  ------           --------------  -----  0   barrels08        3 non-null      float641   barrelsA08       3 non-null      float642   charge120        3 non-null      float643   charge240        3 non-null      float644   city08           3 non-null      int64  5   city08U          3 non-null      float646   cityA08          3 non-null      int64  7   cityA08U         3 non-null      float648   cityCD           3 non-null      float649   cityE            3 non-null      float6410  cityUF           3 non-null      float6411  co2              3 non-null      int64  12  co2A             3 non-null      int64  13  co2TailpipeAGpm  3 non-null      float6414  co2TailpipeGpm   3 non-null      float6415  comb08           3 non-null      int64  16  comb08U          3 non-null      float6417  combA08          3 non-null      int64  18  combA08U         3 non-null      float6419  combE            3 non-null      float6420  combinedCD       3 non-null      float6421  combinedUF       3 non-null      float6422  cylinders        3 non-null      int64  23  displ            3 non-null      float6424  drive            3 non-null      object 25  engId            3 non-null      int64  26  eng_dscr         3 non-null      object 27  feScore          3 non-null      int64  28  fuelCost08       3 non-null      int64  29  fuelCostA08      3 non-null      int64  30  fuelType         3 non-null      object 31  fuelType1        3 non-null      object 32  ghgScore         3 non-null      int64  33  ghgScoreA        3 non-null      int64  34  highway08        3 non-null      int64  35  highway08U       3 non-null      float6436  highwayA08       3 non-null      int64  37  highwayA08U      3 non-null      float6438  highwayCD        3 non-null      float6439  highwayE         3 non-null      float6440  highwayUF        3 non-null      float6441  hlv              3 non-null      int64  42  hpv              3 non-null      int64  43  id               3 non-null      int64  44  lv2              3 non-null      int64  45  lv4              3 non-null      int64  46  make             3 non-null      object 47  model            3 non-null      object 48  mpgData          3 non-null      object 49  phevBlended      3 non-null      bool   50  pv2              3 non-null      int64  51  pv4              3 non-null      int64  52  range            3 non-null      int64  53  rangeCity        3 non-null      float6454  rangeCityA       3 non-null      float6455  rangeHwy         3 non-null      float6456  rangeHwyA        3 non-null      float6457  trany            3 non-null      object 58  UCity            3 non-null      float6459  UCityA           3 non-null      float6460  UHighway         3 non-null      float6461  UHighwayA        3 non-null      float6462  VClass           3 non-null      object 63  year             3 non-null      int64  64  youSaveSpend     3 non-null      int64  65  guzzler          1 non-null      object 66  trans_dscr       1 non-null      object 67  tCharger         0 non-null      float6468  sCharger         0 non-null      float6469  atvType          0 non-null      float6470  fuelType2        0 non-null      float6471  rangeA           0 non-null      float6472  evMotor          0 non-null      float6473  mfrCode          0 non-null      float6474  c240Dscr         0 non-null      float6475  charge240b       3 non-null      float6476  c240bDscr        0 non-null      float6477  createdOn        3 non-null      object 78  modifiedOn       3 non-null      object 79  startStop        0 non-null      float6480  phevCity         3 non-null      int64  81  phevHwy          3 non-null      int64  82  phevComb         3 non-null      int64
dtypes: bool(1), float64(41), int64(28), object(13)
memory usage: 2.0+ KB
In[6]: fueleco=pd.read_csv("vehicles.csv")
D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py:3155: DtypeWarning: Columns (70,71,72,73,74,76,79) have mixed types.Specify dtype option on import or set low_memory=False.has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
In[7]: fueleco=pd.read_csv("vehicles.csv",usecols=list(range(0,70,1)))
In[8]: fueleco.mean()
Out[8]:
barrels08             17.442712
barrelsA08             0.219276
charge120              0.000000
charge240              0.029630
city08                18.077799
city08U                5.040648
cityA08                0.569883
cityA08U               0.416097
cityCD                 0.000560
cityE                  0.225181
cityUF                 0.000975
co2                   72.538989
co2A                   5.543950
co2TailpipeAGpm       17.826864
co2TailpipeGpm       470.704841
comb08                20.323828
comb08U                5.652724
combA08                0.631160
combA08U               0.453725
combE                  0.230912
combinedCD             0.000459
combinedUF             0.000959
cylinders              5.729105
displ                  3.309829
engId               8582.377382
feScore                0.122580
fuelCost08          2242.470781
fuelCostA08           91.335260
ghgScore               0.120866
ghgScoreA             -0.923889
highway08             24.208588
highway08U             6.712736
highwayA08             0.736452
highwayA08U            0.523423
highwayCD              0.000343
highwayE               0.238526
highwayUF              0.000938
hlv                    2.029539
hpv                   10.411243
id                 19662.541188
lv2                    1.834812
lv4                    6.155930
phevBlended            0.001458
pv2                   13.649574
pv4                   33.883711
range                  0.500243
rangeCity              0.458375
rangeCityA             0.050978
rangeHwy               0.450392
rangeHwyA              0.046958
UCity                 22.789421
UCityA                 0.723139
UHighway              33.884375
UHighwayA              1.009562
year                2000.635406
youSaveSpend       -3459.572645
dtype: float64
In[9]: fueleco.describe(include='object')
Out[9]: drive eng_dscr fuelType  ... tCharger sCharger atvType
count               37912    23431    39101  ...     5816      738    3204
unique                  7      545       14  ...        1        1       8
top     Front-Wheel Drive    (FFS)  Regular  ...        T        S     FFV
freq                13653     8827    25620  ...     5816      738    1383
[4 rows x 14 columns]
In[10]: fueleco.make.value_counts()
Out[10]:
Chevrolet                      3900
Ford                           3208
Dodge                          2557
GMC                            2442
Toyota                         1976...
Shelby                            1
Grumman Allied Industries         1
Qvale                             1
Volga Associated Automobile       1
Goldacre                          1
Name: make, Length: 134, dtype: int64
In[11]: fueleco.make.nunique()
Out[11]: 134
In[12]: fueleco.select_dtypes("int64")
Out[12]: city08  cityA08  co2  co2A  comb08  ...  pv2  pv4  range  year  youSaveSpend
0          19        0   -1    -1      21  ...    0    0      0  1985         -1750
1           9        0   -1    -1      11  ...    0    0      0  1985        -10500
2          23        0   -1    -1      27  ...    0    0      0  1985           250
3          10        0   -1    -1      11  ...    0    0      0  1985        -10500
4          17        0   -1    -1      19  ...    0   90      0  1993         -4750...      ...  ...   ...     ...  ...  ...  ...    ...   ...           ...
39096      19        0   -1    -1      22  ...    0   90      0  1993         -1500
39097      20        0   -1    -1      23  ...    0   90      0  1993         -1000
39098      18        0   -1    -1      21  ...    0   90      0  1993         -1750
39099      18        0   -1    -1      21  ...    0   90      0  1993         -1750
39100      16        0   -1    -1      18  ...    0   90      0  1993         -5500
[39101 rows x 24 columns]
In[13]: fueleco.select_dtypes("int64").describe().T
Out[13]: count          mean           std  ...      50%      75%      max
city08        39101.0     18.077799      6.970672  ...     17.0     20.0    150.0
cityA08       39101.0      0.569883      4.297124  ...      0.0      0.0    145.0
co2           39101.0     72.538989    163.252019  ...     -1.0     -1.0    847.0
co2A          39101.0      5.543950     55.956932  ...     -1.0     -1.0    713.0
comb08        39101.0     20.323828      6.882807  ...     20.0     23.0    136.0
combA08       39101.0      0.631160      4.395797  ...      0.0      0.0    133.0
engId         39101.0   8582.377382  17606.675590  ...    202.0   4401.0  69102.0
feScore       39101.0      0.122580      2.516348  ...     -1.0     -1.0     10.0
fuelCost08    39101.0   2242.470781    601.273869  ...   2250.0   2500.0   6850.0
fuelCostA08   39101.0     91.335260    479.485802  ...      0.0      0.0   3850.0
ghgScore      39101.0      0.120866      2.512612  ...     -1.0     -1.0     10.0
ghgScoreA     39101.0     -0.923889      0.651017  ...     -1.0     -1.0      8.0
highway08     39101.0     24.208588      7.128070  ...     24.0     27.0    122.0
highwayA08    39101.0      0.736452      4.694207  ...      0.0      0.0    121.0
hlv           39101.0      2.029539      5.959735  ...      0.0      0.0     49.0
hpv           39101.0     10.411243     28.167271  ...      0.0      0.0    195.0
id            39101.0  19662.541188  11413.329199  ...  19552.0  29555.0  39483.0
lv2           39101.0      1.834812      4.407887  ...      0.0      0.0     41.0
lv4           39101.0      6.155930      9.698101  ...      0.0     13.0     55.0
pv2           39101.0     13.649574     31.214466  ...      0.0      0.0    194.0
pv4           39101.0     33.883711     45.991687  ...      0.0     91.0    192.0
range         39101.0      0.500243      9.742080  ...      0.0      0.0    335.0
year          39101.0   2000.635406     10.690422  ...   2001.0   2010.0   2018.0
youSaveSpend  39101.0  -3459.572645   3010.284617  ...  -3500.0  -1500.0   5250.0
[24 rows x 8 columns]
In[14]: # iinfo function in numpy will show the limit for integer types
In[15]: np.iinfo(np.int8)
Out[15]: iinfo(min=-128, max=127, dtype=int8)
In[16]: np.iinfo(int16)
Traceback (most recent call last):File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_codeexec(code_obj, self.user_global_ns, self.user_ns)File "<ipython-input-16-33fbc0c72155>", line 1, in <module>np.iinfo(int16)
NameError: name 'int16' is not defined
In[17]: np.iinfo(np.int16)
Out[17]: iinfo(min=-32768, max=32767, dtype=int16)
In[18]: # 'cit08' and 'comb08' don't go above up to 150
In[19]: fueleco[['city08','comb08']].astype('int16').info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 2 columns):#   Column  Non-Null Count  Dtype
---  ------  --------------  -----0   city08  39101 non-null  int161   comb08  39101 non-null  int16
dtypes: int16(2)
memory usage: 152.9 KB
In[20]: fueleco['city08','comb08'].info(memory_usage='deep')
Traceback (most recent call last):File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_locreturn self._engine.get_loc(casted_key)File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_locFile "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_locFile "pandas\_libs\hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_itemFile "pandas\_libs\hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: ('city08', 'comb08')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_codeexec(code_obj, self.user_global_ns, self.user_ns)File "<ipython-input-20-e5f2d55949d1>", line 1, in <module>fueleco['city08','comb08'].info(memory_usage='deep')File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__indexer = self.columns.get_loc(key)File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_locraise KeyError(key) from err
KeyError: ('city08', 'comb08')
In[21]: fueleco[['city08','comb08']].info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 2 columns):#   Column  Non-Null Count  Dtype
---  ------  --------------  -----0   city08  39101 non-null  int641   comb08  39101 non-null  int64
dtypes: int64(2)
memory usage: 611.1 KB
In[22]: # so just modify the type of 'city08' and 'comb08' into 'int16'
In[23]: fueleco[['city08','comb08']].assign()
Out[23]: city08  comb08
0          19      21
1           9      11
2          23      27
3          10      11
4          17      19...     ...
39096      19      22
39097      20      23
39098      18      21
39099      18      21
39100      16      18
[39101 rows x 2 columns]
In[24]: (fueleco[['city08','comb08']]...:     .assin(city08=fueleco.city08.astype(np.int16),...:            comb08=fueleco.comb08.astype(np.in16)...:            )...:  .info(memory_usage='deep')...:  )
Traceback (most recent call last):File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_codeexec(code_obj, self.user_global_ns, self.user_ns)File "<ipython-input-24-15d636284b16>", line 1, in <module>(fueleco[['city08','comb08']]File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\generic.py", line 5462, in __getattr__return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'assin'
In[25]: (fueleco[['city08','comb08']]...:     .assign(city08=fueleco.city08.astype(np.int16),...:            comb08=fueleco.comb08.astype(np.in16)...:            )...:  .info(memory_usage='deep')...:  )
Traceback (most recent call last):File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_codeexec(code_obj, self.user_global_ns, self.user_ns)File "<ipython-input-25-fccec6400831>", line 3, in <module>comb08=fueleco.comb08.astype(np.in16)File "D:\PyCharm2020\python2020\lib\site-packages\numpy\__init__.py", line 214, in __getattr__raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'in16'
In[26]: (fueleco[['city08','comb08']]...:     .assign(city08=fueleco.city08.astype('int16'),...:            comb08=fueleco.comb08.astype('in16')...:            )...:  .info(memory_usage='deep')...:  )
Traceback (most recent call last):File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_codeexec(code_obj, self.user_global_ns, self.user_ns)File "<ipython-input-26-fe1ee5143ca8>", line 3, in <module>comb08=fueleco.comb08.astype('in16')File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\generic.py", line 5874, in astypenew_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\internals\managers.py", line 631, in astypereturn self.apply("astype", dtype=dtype, copy=copy, errors=errors)File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\internals\managers.py", line 427, in applyapplied = getattr(b, f)(**kwargs)File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\internals\blocks.py", line 626, in astypedtype = pandas_dtype(dtype)File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\dtypes\common.py", line 1799, in pandas_dtypenpdtype = np.dtype(dtype)
TypeError: data type 'in16' not understood
In[27]: (fueleco[['city08','comb08']]...:     .assign(city08=fueleco.city08.astype(np.int16),...:            comb08=fueleco.comb08.astype(np.int16)...:            )...:  .info(memory_usage='deep')...:  )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 2 columns):#   Column  Non-Null Count  Dtype
---  ------  --------------  -----0   city08  39101 non-null  int161   comb08  39101 non-null  int16
dtypes: int16(2)
memory usage: 152.9 KB
In[28]: fueleco['make','model'].nunique()
Traceback (most recent call last):File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_locreturn self._engine.get_loc(casted_key)File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_locFile "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_locFile "pandas\_libs\hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_itemFile "pandas\_libs\hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: ('make', 'model')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_codeexec(code_obj, self.user_global_ns, self.user_ns)File "<ipython-input-28-bd185d23b85a>", line 1, in <module>fueleco['make','model'].nunique()File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__indexer = self.columns.get_loc(key)File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_locraise KeyError(key) from err
KeyError: ('make', 'model')
In[29]: fueleco[['make','model']].nunique()
Out[29]:
make      134
model    3816
dtype: int64
In[30]: # 'make' has a low cardinality, so convert it into 'category' for memory reusageIn[33]: (...:     fueleco[['make']]...:     .assign(make=fueleco.make.astype('category')...: ).info(memory_usage='deep'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 1 columns):#   Column  Non-Null Count  Dtype
---  ------  --------------  -----   0   make    39101 non-null  category
dtypes: category(1)
memory usage: 89.5 KB
In[34]: (fueleco[['model']]...:     .assign(model=fueleco.model.astype('category'))...:  .info(memory_usage='deep'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 1 columns):#   Column  Non-Null Count  Dtype
---  ------  --------------  -----   0   model   39101 non-null  category
dtypes: category(1)
memory usage: 465.7 KBIn[36]: fueleco.make.value_counts()
Out[36]:
Chevrolet                      3900
Ford                           3208
Dodge                          2557
GMC                            2442
Toyota                         1976...
Shelby                            1
Grumman Allied Industries         1
Qvale                             1
Volga Associated Automobile       1
Goldacre                          1
Name: make, Length: 134, dtype: int64In[37]: # there are so many values in the above summary, now look at the top 6 and collapse the remaining values
In[38]: top_n=fueleco.make.value_counts().index[:6]
In[39]: fueleco.value_counts()
Out[39]: Series([], dtype: int64)
In[40]: fueleco.make.value_counts()
Out[40]:
Chevrolet                      3900
Ford                           3208
Dodge                          2557
GMC                            2442
Toyota                         1976...
Shelby                            1
Grumman Allied Industries         1
Qvale                             1
Volga Associated Automobile       1
Goldacre                          1
Name: make, Length: 134, dtype: int64In[42]: top_n
Out[42]: Index(['Chevrolet', 'Ford', 'Dodge', 'GMC', 'Toyota', 'BMW'], dtype='object')In[44]: (fueleco.assign(...:     make=fueleco.make.where(fueleco.make.isin(top_n),"other"))...:     .make.value_counts())
Out[44]:
other        23211
Chevrolet     3900
Ford          3208
Dodge         2557
GMC           2442
Toyota        1976
BMW           1807
Name: make, dtype: int64
In[45]: # determine the number and percent of missing values
In[46]: fueleco.drive.isna().sum()
Out[46]: 1189
In[47]: fueleco.isna().mean()
Out[47]:
barrels08     0.000000
barrelsA08    0.000000
charge120     0.000000
charge240     0.000000
city08        0.000000...
guzzler       0.940283
trans_dscr    0.615176
tCharger      0.851257
sCharger      0.981126
atvType       0.918058
Length: 70, dtype: float64
In[48]: fueleco.drive.isna().mean()
Out[48]: 0.030408429451932176
In[49]: fueleco.drive.isna().mean()*100
Out[49]: 3.0408429451932175
In[50]: # use .nunique method to determine cardinality
In[51]: fueleco.drive.nunique()
Out[51]: 7
In[52]: # pick out the columns with data types that are object
In[53]: fueleco.select_dtypes('object').columns
Out[53]:
Index(['drive', 'eng_dscr', 'fuelType', 'fuelType1', 'make', 'model','mpgData', 'trany', 'VClass', 'guzzler', 'trans_dscr', 'tCharger','sCharger', 'atvType'],dtype='object')In[55]: import matplotlib.pyplot as plt
Backend TkAgg is interactive backend. Turning interactive mode on.
fig,ax=plt.subplots(figsize=(10,8))
top_n=fueleco.make.value_counts().index[:6]
(fueleco.assign(make=fueleco.make.where(fueleco.make.isin(top_n),'Other')).make.value_counts().plot.bar(ax=ax))
Out[58]: <AxesSubplot:>
fig.savefig("c5-catpan.png",dpi=300)

.cut .qcut(quantile cut) used to cut into equal-width bins or bin width that we specify, with these methods we can treat numeric columns as categories by binning them .

continuous data

import seaborn as sns
fig,ax=plt.subplots(figsize=(10,8))
sns.countplot(y='make',data=(fueleco.assign(make=fueleco.make.where(fueleco.make.isin(top_n),"Other"))))
Out[66]: <AxesSubplot:xlabel='count', ylabel='make'>
fig.savefig("c5-catsns.png",dpi=300)# rows where 'drive' are missing
fueleco[fueleco.drive.isna()]
Out[69]: barrels08  barrelsA08  charge120  ...  tCharger  sCharger  atvType
7138    0.240000         0.0        0.0  ...       NaN       NaN       EV
8144    0.312000         0.0        0.0  ...       NaN       NaN       EV
8147    0.270000         0.0        0.0  ...       NaN       NaN       EV
18215  15.695714         0.0        0.0  ...       NaN       NaN      NaN
18216  14.982273         0.0        0.0  ...       NaN       NaN      NaN...         ...        ...  ...       ...       ...      ...
23023   0.240000         0.0        0.0  ...       NaN       NaN       EV
23024   0.546000         0.0        0.0  ...       NaN       NaN       EV
23026   0.426000         0.0        0.0  ...       NaN       NaN       EV
23031   0.426000         0.0        0.0  ...       NaN       NaN       EV
23034   0.204000         0.0        0.0  ...       NaN       NaN       EV
[1189 rows x 70 columns]# by default, .value_counts does not show missing values, but use dropna parameter
fueleco.drive.value_counts(dropna=False)
Out[71]:
Front-Wheel Drive             13653
Rear-Wheel Drive              13284
4-Wheel or All-Wheel Drive     6648
All-Wheel Drive                2401
4-Wheel Drive                  1221
NaN                            1189
2-Wheel Drive                   507
Part-time 4-Wheel Drive         198
Name: drive, dtype: int64fig,ax=plt.subplots(figsize=(10,8))
fueleco.drive.value_counts(dropna=False).plot.bar(ax=ax)
Out[73]: <AxesSubplot:>
fig.savefig('c5-aa',dpi=300)",".join('abcd')
Out[76]: 'a,b,c,d'fueleco.city08.sample(5,random_state=40)
Out[80]:
4643     16
1483     15
34149    21
563      14
2364     19
Name: city08, dtype: int64
fueleco.city08.sample(5,random_state=42)
Out[81]:
4217     11
1736     21
36029    16
37631    16
1668     17
Name: city08, dtype: int64
# use pandas to plot a histogram
import matplotlib.pyplot as plt
fig,ax=plt.subplots(figsize=(10,8))
fueleco.city08.hist(ax=ax)
Out[85]: <AxesSubplot:>
fig.savefig('hist.png',dpi=300)
# the plot looks very skewed, so increase the number of bins in the histogram to see if the skew is hiding behaviors
# as the skew makes bins wider
fig,ax=plt.subplots(figsize=(10,8))
fueleco.city08.hist(ax=ax,bins=30)
Out[90]: <AxesSubplot:>
fig.savefig("hist-bins-30.png",dpi=300)
# use seaborn to create a distribution plot,which includes a histogram, a kernel density estimation(KDE), a rug plot
sns.distplot(fueleco.city08,rug=True,ax=ax)
D:\PyCharm2020\python2020\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).warnings.warn(msg, FutureWarning)
D:\PyCharm2020\python2020\lib\site-packages\seaborn\distributions.py:2056: FutureWarning: The `axis` variable is no longer used and will be removed. Instead, assign variables directly to `x` or `y`.warnings.warn(msg, FutureWarning)
Out[96]: <AxesSubplot:xlabel='city08', ylabel='Density'>
fig.savefig('rugplot.png',dpi=300)

[160(181/627)]

关于seaborn绘图函数：boxplot,boxenplot,violin plots的介绍

【162（183/627）】
用图形检验数据是否服从正态分布：

fig.savefig('rugplot.png',dpi=300)
fig,ax=plt.subplots(nrows=3,figsize=(10,8))
sns.boxplot(fueleco.city08,ax=ax[0])
sns.violinplot(fueleco.city08,ax=ax[1])
sns.boxenplot(fueleco.city08,ax=ax[2])
fig.savefig('subplots_nroes3.png',dpi=300)
from scipy import stats
stats.kstest(fueleco.city08,cdf='norm')
Out[104]: KstestResult(statistic=0.9999999990134123, pvalue=0.0)
fig,ax=plt.subplots(figsize=(10,8))
stats.probplot(fueleco.city08,plot=ax)
Out[106]:
((array([-4.1352692 , -3.92687024, -3.81314873, ...,  3.81314873,3.92687024,  4.1352692 ]),array([  6,   6,   6, ..., 137, 138, 150], dtype=int64)),(5.385946629915974, 18.077798521776934, 0.772587941459713))
fig.savefig('proplor.png',dpi=300)

Comparing continuous values across categories

# make a mask for the brands we want
mask=fueleco.make.isin(["Ford","Honda","Tesla","BMW"])
mask
Out[112]:
0        False
1        False
2        False
3        False
4        False...
39096    False
39097    False
39098    False
39099    False
39100    False
Name: make, Length: 39101, dtype: bool
type(mask)
Out[113]: pandas.core.series.Series
fueleco[mask]
Out[114]: barrels08  barrelsA08  charge120  ...  tCharger  sCharger  atvType
20     20.600625         0.0        0.0  ...       NaN       NaN      NaN
21     20.600625         0.0        0.0  ...       NaN       NaN      NaN
22     25.354615         0.0        0.0  ...       NaN       NaN      NaN
56     15.695714         0.0        0.0  ...       NaN       NaN      NaN
57     17.347895         0.0        0.0  ...       NaN       NaN      NaN...         ...        ...  ...       ...       ...      ...
39016  13.733750         0.0        0.0  ...       NaN       NaN      NaN
39017  17.347895         0.0        0.0  ...       NaN       NaN      NaN
39018  15.695714         0.0        0.0  ...       NaN       NaN      NaN
39023  14.982273         0.0        0.0  ...       NaN       NaN      NaN
39025  13.733750         0.0        0.0  ...       NaN       NaN      NaN
[5986 rows x 70 columns]
fueleco[mask].groupby("make").city08.agg(["mean","std"])
Out[115]: mean       std
make
BMW    17.817377  7.372907
Ford   16.853803  6.701029
Honda  24.372973  9.154064
Tesla  92.826087  5.538970
# and then use a group by operation to look at the mean and std for the city08 column
# visualize the city08 values for each make with seborn
g=sns.catplot(x="make",y="city08",data=fueleco[mask],kind='box')
g.ax.figure.savefig("c5-catbox.png",dpi=300)
# one of drawback of a boxplot is that while it indicates the spread of the data, it does not reveal how many samples are in each make
mask=fueleco.make.isin(["Ford","Honda","Tesla","BMW"]) # 布尔型向量，其分量对应fueleco的每一行的make标签是否在特定的这四个元素组成的数组中
fueleco[mask].groupby("make").city08.count()
Out[123]:
make
BMW      1807
Ford     3208
Honda     925
Tesla      46
Name: city08, dtype: int64
# faceet the grid by another feature
# break each of these new plot into its own graph by using the col parameter
g=sns.catplot(x="make",y="city08",data=fueleco[mask],kind='box',col='year',col_order=[2012,2014,2016,2018])
g=sns.catplot(x="make",y="city08",data=fueleco[mask],kind='box',col='year',col_order=[2012,2014,2016,2018],col_wrap=2)
# col为划分依据，col_order是小窗顺序
# embed the new dimension in the same plot by using the hue parameter
g=sns.catplot(x="make",y="")

Comparing two continuous columns

# Comparing two continuous columns
# if you have two columnswith a high correlation to one another, often , you may drop one of them as a redundant column
# covariance of the two numbers if they are on the same scale
fueleco.city08.cov(fueleco.highway08)
Traceback (most recent call last):File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_codeexec(code_obj, self.user_global_ns, self.user_ns)File "<ipython-input-5-be0843446ebc>", line 1, in <module>fueleco.city08.cov(fueleco.highway08)
NameError: name 'fueleco' is not defined
import pandas as pd
import numpy as np
fueleco=pd.read_csv("vehicles.csv",usecols=list(range(1,70,1)))
fueleco.city08.cov(fueleco.highway08)
Out[9]: 46.33326023673624
fueleco.city08.cov(fueleco.comb08)
Out[10]: 47.419946678190776
fueleco.city08.cov(fueleco.cylinders)
Out[11]: -5.931560263764768
# Pearson correlation between the two numbers
fueleco.city08.corr(fueleco.highway08)
Out[13]: 0.932494506228495
# use pandas to scatter plot the relationship
import matplotlib.pyplot as plt
Backend TkAgg is interactive backend. Turning interactive mode on.
fig,ax=plt.subplots(figsize=(8,8))
fueleco.plot.scatter(x="city08",y="highway08",ax=ax)
Out[17]: <AxesSubplot:xlabel='city08', ylabel='highway08'>
fueleco.plot.scatter(x="city08",y="highway08",ax=ax,alpha=0.01)
Out[18]: <AxesSubplot:xlabel='city08', ylabel='highway08'>
fig.savefig('scatterplot_alpha0.01.png',dpi=300)
fig,ax=plt.subplots(figsize=(8,8))
fueleco.plot.scatter(x="city08",y="highway08",ax=ax,alpha=0.01)
Out[21]: <AxesSubplot:xlabel='city08', ylabel='highway08'>
fueleco.plot.scatter(x="city08",y="highway08",ax=ax,alpha=0.1)
# pearson correlation is intended to show the strength of a linear relationship.
# if the continuous columns columns do not have a linear relationship, another option is use Spearman correlation
# this number also varies from -1 to 1
# it measrues whether the relationship is monotonic and doesn't presume that it is linear
# it use the rank of each number rather than the number if you are not sure whther there is a linear relationship between your coulmns, this is a better metric to use
fueleco.city08.corr(fueleco.barrelsA08,method='spearman')
Out[31]: -0.08476703673460519
# Pearson correlation tells us how one value impacts another
# covariance lets us know how these values vary together
# a heatmap is a great way to look at correlations in aggregate
# scatter plots are another way to visualize the relationships between continuous variables
# set alpha parameter to a value less than or equal to 0.5, which makes the points transparent
# now , add more dimension to a scatter plot
# using the replot function, we can color to the dotd by year aand size them by the number of barrals the vehicles consumes
# in this case, go from 2 dimension to 4dimensions
res=sns.relplot(x="city08",y="highway08",data=fueleco.assign(cylinders=fueleco.cylinders.fillna(0)),hue='year',size='barrelsA08',alpha=0.5,height=8)
res=sns.relplot(x="city08",y="highway08",data=fueleco.assign(cylinders=fueleco.cylinders.fillna(0)),hue='year',size='barrelsA08',alpha=0.5,height=8,col='make',col_order=['Ford',"Tesla"])

Comparing categorical vaules with categorical values

# continuous columns can be converted into categorical columns by binning the values

[179(200/627)] 没大看懂这是在做什么……

# if you use seaborn, you can add multiple dimensions by setting 'hue' or 'col'

Using the pandas profiling library

[187/(208/627)]

【CookBook pandas】学习笔记第五章 Exploratory Data Analysis相关推荐

IBM Machine Learning学习笔记（一）——Exploratory Data Analysis for Machine Learning
数据的探索性分析 1. 读入数据 (1)csv文件读取 (2)json文件读取 (3)SQL数据库读取 (4)Not-only SQL (NoSQL)读取 (5)从网络中获取 2. 数据清洗 (1)缺 ...
Programming Entity Framework-dbContext 学习笔记第五章
### Programming Entity Framework-dbContext 学习笔记第五章将图表添加到Context中的方式及容易出现的错误方法结果警告 Add Root 图标中的 ...
《Go语言圣经》学习笔记第五章函数
<Go语言圣经>学习笔记第五章函数目录函数声明递归多返回值匿名函数可变参数 Deferred函数 Panic异常 Recover捕获异常注:学习<Go语言圣经> ...
2022 最新 Android 基础教程，从开发入门到项目实战【b站动脑学院】学习笔记——第五章：中级控件
第 5 章中级控件本章介绍App开发常见的几类中级控件的用法,主要包括:如何定制几种简单的图形.如何使用几种选择按钮.如何高效地输入文本.如何利用对话框获取交互信息等,然后结合本章所学的知识,演示 ...
《Android深度探究HAL与驱动开发》学习笔记----第五章
第五章搭建S3C6410开发板的测试环境开发板是开发和学习嵌入式技术的主要硬件设备. 主要学习了搭建S3C6410开发板的测试环境.首先要了解到S3C6410是一款低功耗.高性价比的RISC处理器 ...
muduo学习笔记 - 第五章高效的多线程日志
第五章高效的多线程日志日志有两种意思: 诊断日志交易日志本章讲的是前一种日志,文本的供人阅读的日志,通常用于故障诊断和追踪,也可用于性能分析. 日志通常要记录: 收到的每条消息的id(关键字段 ...
javascript高级程序设计学习笔记第五章上
第五章引用类型的值(对象)是引用类型的一个实例.在 ECMAScript 中,引用类型是一种数据结构, 用于将数据和功能组织在一起.它也常被称为类,但这种称呼并不妥当.尽管 ECMAScri ...
css层叠样式表基础学习笔记--第五章文本属性
第五章文本属性 5-01 字间距 5-02 行高 5-03 首行缩进 5-04 水平排列方式 5-05 垂直对齐方式 5-06 文本修饰 5-07 文本阴影 5-08 文本属性重置 5-01 字间距 ...
程序设计与算法三~C++面向对象程序设计~北大郭炜MOOC学习笔记~第五章：继承与派生（新标准C++程序设计）
以下内容为笔者手打,望读者珍惜,如有转载还请注明. 第五章继承与派生 $5.1 继承与派生的概念 $5.1.1 基本概念在C++中,当定义一个新的类B时,如果发现类B拥有某个已经写好的类A ...

【CookBook pandas】学习笔记第五章 Exploratory Data Analysis