Sklearn常用数据预处理方法介绍
主要介绍了Sklearn中常用的数据预处理方法。
数据预处理
1.导入用到的库
import numpy as np
import pandas as pd
from sklearn.preprocessing import Imputer
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import Binarizer
from sklearn.cluster import KMeans
2.读取数据集
- “teenager_sns”包含30000个样本的美国高中生社交网络信息数据集。每个样本包含40个变量,其中 gradyear, gender, age和friends四个变量代表高中生的毕业年份、性别、年龄和好友数等基本信息。 其余36个变量代表36个词语,代表高中生的5大兴趣。
- “accord_sedan_testing”是一个二手汽车数据集,包含二手汽车的价格、已行驶英里、上市年份、档次、引擎缸数、换挡方式等
teenager_sns = pd.read_csv("./input/teenager_sns.csv")
teenager_sns.head(5)
gradyear | gender | age | friends | basketball | football | soccer | softball | volleyball | swimming | ... | blonde | mall | shopping | clothes | hollister | abercrombie | die | death | drunk | drugs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2006 | M | 18.980 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 2006 | F | 18.801 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 2006 | M | 18.335 | 69 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3 | 2006 | F | 18.875 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 2006 | NaN | 18.995 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
5 rows × 40 columns
test = pd.read_csv('./input/accord_sedan_testing.csv')
# 将数据集中的已行驶英里数“mileage”与价格“price”汇总
data = test[['mileage','price']]
data.head()
mileage | price | |
---|---|---|
0 | 68265 | 12995 |
1 | 92778 | 9690 |
2 | 136000 | 8995 |
3 | 72765 | 11995 |
4 | 36448 | 17999 |
3.缺失值处理
缺失值(NaN)处理:from sklearn.preprocessing import Imputer
"""
使用sklearn中的Imputer方法,将数据集“teenager_sns”中“age”列利用均值“mean”进行填充
"""
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(teenager_sns[["age"]])
teenager_sns["age_imputed"]=imp.transform(teenager_sns[["age"]])
# 显示年龄缺失的行,和插补缺失值之后的列"age_imputed"
teenager_sns[teenager_sns['age'].isnull()]
gradyear | gender | age | friends | basketball | football | soccer | softball | volleyball | swimming | ... | mall | shopping | clothes | hollister | abercrombie | die | death | drunk | drugs | age_imputed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 2006 | F | NaN | 142 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 17.993949 |
13 | 2006 | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
15 | 2006 | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
16 | 2006 | NaN | NaN | 135 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
26 | 2006 | F | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
38 | 2006 | F | NaN | 17 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 17.993949 |
41 | 2006 | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
49 | 2006 | M | NaN | 35 | 1 | 3 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
68 | 2006 | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
71 | 2006 | NaN | NaN | 31 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
74 | 2006 | F | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
82 | 2006 | NaN | NaN | 70 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
93 | 2006 | F | NaN | 12 | 1 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
94 | 2006 | NaN | NaN | 36 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 17.993949 |
97 | 2006 | F | NaN | 13 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
102 | 2006 | F | NaN | 1 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 17.993949 |
111 | 2006 | F | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 17.993949 |
114 | 2006 | F | NaN | 36 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
130 | 2006 | NaN | NaN | 9 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
137 | 2006 | NaN | NaN | 55 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
149 | 2006 | F | NaN | 15 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
155 | 2006 | NaN | NaN | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
157 | 2006 | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
160 | 2006 | NaN | NaN | 73 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
162 | 2006 | F | NaN | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
165 | 2006 | NaN | NaN | 6 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
173 | 2006 | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
179 | 2006 | NaN | NaN | 17 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
191 | 2006 | NaN | NaN | 54 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
228 | 2006 | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
29843 | 2009 | F | NaN | 14 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29861 | 2009 | F | NaN | 38 | 1 | 0 | 0 | 1 | 1 | 0 | ... | 0 | 4 | 0 | 1 | 1 | 1 | 0 | 2 | 1 | 17.993949 |
29880 | 2009 | NaN | NaN | 0 | 1 | 0 | 0 | 4 | 2 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29883 | 2009 | F | NaN | 12 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29884 | 2009 | F | NaN | 12 | 0 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29885 | 2009 | M | NaN | 142 | 1 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29891 | 2009 | F | NaN | 12 | 1 | 0 | 1 | 1 | 0 | 0 | ... | 1 | 1 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 17.993949 |
29893 | 2009 | NaN | NaN | 3 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 17.993949 |
29899 | 2009 | F | NaN | 53 | 1 | 2 | 0 | 0 | 5 | 1 | ... | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29903 | 2009 | F | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29905 | 2009 | NaN | NaN | 45 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29910 | 2009 | F | NaN | 34 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 17.993949 |
29912 | 2009 | M | NaN | 8 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29916 | 2009 | M | NaN | 52 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 17.993949 |
29921 | 2009 | F | NaN | 36 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29925 | 2009 | F | NaN | 32 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29927 | 2009 | F | NaN | 7 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29928 | 2009 | NaN | NaN | 36 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29931 | 2009 | NaN | NaN | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29934 | 2009 | F | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29936 | 2009 | M | NaN | 63 | 0 | 3 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29937 | 2009 | F | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 17.993949 |
29949 | 2009 | M | NaN | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29957 | 2009 | M | NaN | 14 | 2 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29970 | 2009 | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29980 | 2009 | F | NaN | 3 | 3 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29989 | 2009 | F | NaN | 0 | 6 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29991 | 2009 | F | NaN | 229 | 3 | 0 | 0 | 0 | 0 | 2 | ... | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 17.993949 |
29992 | 2009 | M | NaN | 7 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
29993 | 2009 | F | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
5086 rows × 41 columns
print(data.describe())
mileage price
count 100.000000 100.000000
mean 83790.840000 11865.200000
std 26538.304745 2233.809407
min 22110.000000 2612.000000
25% 64370.500000 10497.500000
50% 84196.500000 11992.500000
75% 100797.250000 13483.500000
max 140487.000000 17999.000000
4.离群值检测
离群值检测:from sklearn.neighbors import LocalOutlierFactor
""
使用sklearn中的LocalOutlierFactor方法对数据集“accord_sedan_testing”进行离群值检测
"""
scaler = LocalOutlierFactor()
scaler.fit(data)
data['LOF'] = - scaler.negative_outlier_factor_
# 显示局部离群因子大于1.5的样本
data[data.LOF>1.5]
mileage | price | LOF | |
---|---|---|---|
4 | 36448 | 17999 | 1.534739 |
52 | 22110 | 14399 | 2.235552 |
5.数据标准化方法
数据标准化:
- Z-Score标准化:
from sklearn.preprocessing import StandardScaler
- Min-Max标准化:
from sklearn.preprocessing import MinMaxScaler
"""
使用sklearn中的StandarScaler方法,对数据集“teenager_sns”中的“friends”列做Z-Score标准化,使得处理后的数据具有固定均值和标准差
"""
scaler = StandardScaler(copy=True)
teenager_sns_zscore = pd.DataFrame(scaler.fit_transform(teenager_sns[["friends"]]),columns =["friends_StandardScaled"] )
teenager_sns_zscore["friends"] = teenager_sns["friends"]
print("均值:",teenager_sns_zscore["friends_StandardScaled"].mean(axis=0))
print("方差:",teenager_sns_zscore["friends_StandardScaled"].std(axis=0))
teenager_sns_zscore.head(5)
均值: -9.473903143468002e-18
方差: 1.0000166670833448
friends_StandardScaled | friends | |
---|---|---|
0 | -0.634528 | 7 |
1 | -0.826150 | 0 |
2 | 1.062695 | 69 |
3 | -0.826150 | 0 |
4 | -0.552404 | 10 |
"""
使用sklearn中的MinMaxScaler方法,对数据集“teenager_sns”中的“friends”列做Min-Max标准化,使得处理后的数据取值分布在[0,1]区间上
"""
filtered_columns = ["friends"]
scaler = MinMaxScaler(copy=False)
teenager_sns_minmaxscore = pd.DataFrame(scaler.fit_transform(teenager_sns[["friends"]]),
columns = ["friends_MinMaxScaled"])
teenager_sns_minmaxscore["friends"] = teenager_sns["friends"]
teenager_sns_minmaxscore.head(5)
均值: -9.473903143468002e-18
方差: 1.0000166670833448
friends_StandardScaled | friends | |
---|---|---|
0 | -0.634528 | 7 |
1 | -0.826150 | 0 |
2 | 1.062695 | 69 |
3 | -0.826150 | 0 |
4 | -0.552404 | 10 |
"""
使用sklearn中的MinMaxScaler方法,对数据集“teenager_sns”中的“friends”列做Min-Max标准化,使得处理后的数据取值分布在[0,1]区间上
"""
filtered_columns = ["friends"]
scaler = MinMaxScaler(copy=False)
teenager_sns_minmaxscore = pd.DataFrame(scaler.fit_transform(teenager_sns[["friends"]]),
columns = ["friends_MinMaxScaled"])
teenager_sns_minmaxscore["friends"] = teenager_sns["friends"]
teenager_sns_minmaxscore.head(5)
friends_MinMaxScaled | friends | |
---|---|---|
0 | 0.008434 | 7 |
1 | 0.000000 | 0 |
2 | 0.083133 | 69 |
3 | 0.000000 | 0 |
4 | 0.012048 | 10 |
- 考虑离群值的标准化:
from sklearn.preprocessing import RobustScaler
""
使用sklearn中的RobustScaler方法,对数据集“teenager_sns”中的“gradyear”、“friends”列,考虑离群值进行标准化
"""
scaler = RobustScaler(copy=True)
# 增加标准化之后的两列“gradyear_RobustScaled”、"friends_RobustScaled"
teenager_sns_robustscaled = pd.DataFrame(scaler.fit_transform(teenager_sns[["gradyear","friends"]]),columns = ['gradyear_RobustScaled', 'friends_RobustScaled'])
teenager_sns_robustscaled[["gradyear","friends"]]=teenager_sns[["gradyear","friends"]]print(teenager_sns_robustscaled[['gradyear_RobustScaled', 'friends_RobustScaled']].describe())
teenager_sns_robustscaled.head(5)
gradyear_RobustScaled friends_RobustScaled
count 30000.000000 30000.000000
mean 0.000000 0.248280
std 0.745368 0.890997
min -1.000000 -0.487805
25% -0.500000 -0.414634
50% 0.000000 0.000000
75% 0.500000 0.585366
max 1.000000 19.756098
gradyear_RobustScaled | friends_RobustScaled | gradyear | friends | |
---|---|---|---|---|
0 | -1.0 | -0.317073 | 2006 | 7 |
1 | -1.0 | -0.487805 | 2006 | 0 |
2 | -1.0 | 1.195122 | 2006 | 69 |
3 | -1.0 | -0.487805 | 2006 | 0 |
4 | -1.0 | -0.243902 | 2006 | 10 |
6.特征编码
二值化:根据阈值将数据二值化(将特征值设置为0或1),大于阈值的值映射为1,而小于或等于阈值的值映射为0。
from sklearn.preprocessing import Binarizer
- 特征编码:
from sklearn.preprocessing import LabelEncoder
——对于不连续的非数值型特征转换为数值特征""" 使用sklearn中的LabelEncoder方法,对数据集“teenager_sns”中的“gender”进行特征编码 """ le = LabelEncoder() # 打印前4个人的性别 print(teenager_sns["gender"][:4]) # 编码 print(le.fit_transform(teenager_sns["gender"][:4])) print(le.classes_)
0 M 1 F 2 M 3 F Name: gender, dtype: object [1 0 1 0] ['F' 'M']
- One-Hot编码:
from sklearn.preprocessing import OneHotEncoder
——将包含?个取值的离散型特征转换成?个二元特征(取值为0或1) # 对性别进行数字编码 teenager_sns['gender']=teenager_sns['gender'].map({'M':1,'F':2,np.NaN:3})enc = OneHotEncoder() # 对性别用OneHotEncoder进行拟合 enc.fit(teenager_sns[['gender']]) # 活动特征的指数,意味着训练集中实际出现的值 print(enc.active_features_) # 使用One-Hot编码转换 print(enc.transform([[2]]).toarray())
[1 2 3] [[0. 1. 0.]]
7.特征离散化
特征离散化:
- 二值化
- 等距离离散化
- 等频率离散化
- 聚类离散化
二值化:根据阈值将数据二值化(将特征值设置为0或1),大于阈值的值映射为1,而小于或等于阈值的值映射为0。
from sklearn.preprocessing import Binarizer
"""
使用sklearn中的Binarizer方法,对数据集“teenager_sns”中的“friends”列进行二值特征离散化
"""
# 阈值设置为3,大于3的映射为1,小于3的映射为0
scaler = Binarizer(threshold=3)
teenager_sns_binarizer = pd.DataFrame(scaler.fit_transform(teenager_sns[["friends"]]),columns = ["friends_Binarized"])
teenager_sns_binarizer["friends"] = teenager_sns["friends"]
teenager_sns_binarizer.head(5)
friends_Binarized | friends | |
---|---|---|
0 | 1 | 7 |
1 | 0 | 0 |
2 | 1 | 69 |
3 | 0 | 0 |
4 | 1 | 10 |
等距离散化:根据连续型特征的取值,将其均匀地划分成 ? 个区间,每个区间的宽度均相等,然后将特征的取值划入对应的区间从而完成特征离散化
"""
使用pandas中的cut方法,实现等距离散化
"""
data = teenager_sns['friends'].copy()
k = 4
# 等距离散化,各个类比依次命名为0,1,2,3
d1 = pd.cut(data, k, labels = range(k))
d1.head()
0 0 1 0 2 0 3 0 4 0 Name: friends, dtype: category Categories (4, int64): [0 < 1 < 2 < 3]
等频率离散化:根据连续型特征的总个数,将其均匀地划分成 ? 个区间段,使得每个区间段中的样本数相同,然后每一份数据的取值范围即是对应的特征离散化区间
"""
使用pandas中的qcut方法,实现等频率离散化
"""
data = teenager_sns['friends'].copy()
k = 4
# 等频率离散化,各个类比依次命名为'A','B','C','D'
d1 = pd.qcut(data, k, labels = ['A','B','C','D'])
d1.head()
0 B 1 A 2 D 3 A 4 B Name: friends, dtype: category Categories (4, object): [A < B < C < D]
聚类离散化:
- 对于需要离散化的连续型特征,采用聚类算法(如K-means、EM算法等),把样本依据该特征的分布划分成相应的簇或类;
- 在聚类结果的基础上,基于特定的策略,决定是否对簇进行进一步分裂或合并。利用自顶向下的策略可以针对每一个簇继续运行聚类算法,将其细分为更小的子簇;利用自底向上的策略,则可以对邻近相似的簇进行合并处理得到新的簇;
- 在最终确定划分的簇之后,确定切分点以及区间个数。
kmodel = KMeans(n_clusters = k) #建立模型
kmodel.fit(data.values.reshape((len(data), 1))) #训练模型
c = pd.DataFrame(kmodel.cluster_centers_).sort_values(0) #输出聚类中心,并且排序
w = c.rolling(2).mean().iloc[1:]#相邻两项求中点,作为边界点
w = [0] + list(w[0]) + [data.max()] #把首末边界点加上,w[0]中0为列索引
d2 = pd.cut(data, w, labels = ['A','B','C','D'])d2.head(10)
0 A 1 NaN 2 C 3 NaN 4 A 5 D 6 C 7 A 8 B 9 B Name: friends, dtype: category Categories (4, object): [A < B < C < D]
除了上述提到的几种离散化方法,还有信息增益离散化、卡方离散化等方法。
Sklearn常用数据预处理方法介绍相关推荐
- 林业病虫害数据集和数据预处理方法介绍
内容都是百度AIstudio的内容,我只是在这里做个笔记,不是原创. 林业病虫害数据集和数据预处理方法介绍 在本次的课程中,将使用百度与林业大学合作开发的林业病虫害防治项目中用到昆虫数据集,关于该项目 ...
- 深度学习中常用数据预处理方法
数据归一化处理,比较简单公式为 (x-min)/(max-min),主要目的是将数据的范围缩小至0-1之间,因而对数据绝对值的大小不敏感 2.数据标准化处理,也比较简单,公式为(x-avg)/sigm ...
- Pandas常用数据预处理方法
1.日期格式转换 输入日期列为object.string类型,格式是"9/14/2021",需要转换成pandas中可计算的日期格式.实际运行后输出为"2021-09-1 ...
- sklearn中常用的数据预处理方法
常见的数据预处理方法,以下通过sklearn的preprocessing模块来介绍; 1. 标准化(Standardization or Mean Removal and Variance Scali ...
- sklearn中的数据预处理方法学习汇总
文章目录 sklearn中的数据预处理方法学习 一.标准化 Z-score标准化 Z-score标准化学习 Z-score标准化实现 Min-max标准化 MaxAbs标准化 二.非线性转换 映射到均 ...
- python数据预处理的方法_python中常用的九种数据预处理方法
python中常用的九种预处理方法分享 本文总结的是我们大家在python中常见的数据预处理方法,以下通过sklearn的preprocessing模块来介绍; 1. 标准化(Standardizat ...
- python常用的数据预处理方法
2019独角兽企业重金招聘Python工程师标准>>> 转载自:http://2hwp.com/2016/02/03/data-preprocessing/ 常见的数据预处理方法,以 ...
- 数据挖掘导论 复习一(介绍+数据预处理方法+定性归纳)
数据挖掘=数据库+机器学习 算法 经验 模型 机器学习任务:分类.回归.聚类(KMeans.DCSAN.层次聚类).数据降维.数据预处理 常用分类器:KNN.贝叶斯. 逻辑回归 .决策树.随机森林 本 ...
- Python: sklearn库——数据预处理
Python: sklearn库 -- 数据预处理 数据集转换之预处理数据: 将输入的数据转化成机器学习算法可以使用的数据.包含特征提取和标准化. 原因:数据集的标准化(服从均 ...
最新文章
- 自定义类型数组的初始化
- 【Java8新特性】关于Java8的Stream API,看这一篇就够了!!
- Centos启动和禁用网卡命令
- 推荐系统炼丹笔记:聊一聊特征交叉新方式CAN
- 50个jQuery插件可将你的网站带到另一个高度
- 多模块的maven项目,执行 install/deploy 指令时,排除指定module
- 【报错解决】linux网络编程报错storage size of ‘serv_addr’ isn’t known解决办法
- WordPress标签(tag)url自动转化插件
- 前端开发那些不常见但十分有效的小玩意
- StackExchange.Redis 官方文档(五) Keys, Values and Channels
- 模块是python中普通的文件吗_python 包和模块
- mysql有热备吗_mysql备份方法(热备)
- Nodejs接口输出json数据
- zzulioj 1038 python 绝对值最大
- 利用CA证书配置安全Web站点
- 用字节数组存放二维地图数据
- 真实收益DeFi崛起 这些DeFi协议已采用它
- VB.NET MsgBox详解 vs2010
- 为什么机器人运动学逆解最好采用双变量反正切函数atan2而不用反正/余弦函数?
- Kubernetes(K8s) kubectl cordon, drain, uncordon 常用命令
热门文章
- 如何用计算机整理数据,总结:如何在excel中制作数据统计表(最简单的excel分类汇总教程)...
- 从拼多多薅羊毛事件来看类似事件的后续处理流程(全程吃瓜汇总)
- 什么是代理(Proxy)?
- 【蓝桥杯大赛】简单回忆一下我的蓝桥杯比赛历程
- 配置 已完成 请勿关闭计算机,win7系统关机提示“配置Windows Update已完成30%请勿关闭计算机”的解决方法...
- 牛客网 - [牛客假日团队赛6]对牛排序
- Mac技巧:新手必看Macbook快捷键使用大全
- 拓扑排序(topo_sort)
- redis查看集合中元素的数量,scard
- C++青少年编程课程体系与教案