文章目录

  • 1. Introduction
  • 2. Missing Values 缺失值处理
  • 3. Categorical Variables 文字变量处理

from https://www.kaggle.com/learn/intermediate-machine-learning

下一篇 :【Kaggle】Intermediate Machine Learning(管道+交叉验证)

1. Introduction

  • 按照教程给的7个特征,给定5种参数下的随机森林模型,选出mae误差最小的,进行提交
import pandas as pd
from sklearn.model_selection import train_test_split# Read the data
X_full = pd.read_csv('../input/train.csv', index_col='Id')
X_test_full = pd.read_csv('../input/test.csv', index_col='Id')# Obtain target and predictors
y = X_full.SalePrice
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = X_full[features].copy()
X_test = X_test_full[features].copy()# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,random_state=0)
from sklearn.ensemble import RandomForestRegressor# Define the models,定义了5种参数的随机森林模型
model_1 = RandomForestRegressor(n_estimators=50, random_state=0)
model_2 = RandomForestRegressor(n_estimators=100, random_state=0)
model_3 = RandomForestRegressor(n_estimators=100, criterion='mae', random_state=0)
model_4 = RandomForestRegressor(n_estimators=200, min_samples_split=20, random_state=0)
model_5 = RandomForestRegressor(n_estimators=100, max_depth=7, random_state=0)models = [model_1, model_2, model_3, model_4, model_5]from sklearn.metrics import mean_absolute_error# Function for comparing different models
def score_model(model, X_t=X_train, X_v=X_valid, y_t=y_train, y_v=y_valid):model.fit(X_t, y_t)preds = model.predict(X_v)return mean_absolute_error(y_v, preds)
# 找出误差最小的模型
for i in range(0, len(models)):mae = score_model(models[i])print("Model %d MAE: %d" % (i+1, mae))best_model = models[2]
my_model = best_modelmy_model.fit(X, y)
# Generate test predictions
preds_test = my_model.predict(X_test)# Save predictions in format used for competition scoring
output = pd.DataFrame({'Id': X_test.index,'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)

评分:mae误差 20998.83780

2. Missing Values 缺失值处理

缺失值的处理:

  • 丢弃整列,缺点是信息丢失严重
cols_with_missing = [col for col in X_train.columnsif X_train[col].isnull().any()] # Your code here# Fill in the lines below: drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing,axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing,axis=1)
  • 差值填补,比如填充均值等
from sklearn.impute import SimpleImputer# Fill in the lines below: imputation
help(SimpleImputer)
imp = SimpleImputer()# 默认以均值进行填补
# imp = SimpleImputer(strategy="median") # 中位数填补
imputed_X_train = pd.DataFrame(imp.fit_transform(X_train))# 拟合,填补
imputed_X_valid = pd.DataFrame(imp.transform(X_valid))#填补# Fill in the lines below: imputation removed column names; put them back
imputed_X_train.columns = X_train.columns # 差值去除了特征名称,再填上
imputed_X_valid.columns = X_valid.columns

SimpleImputer 参考如下

class SimpleImputer(_BaseImputer)|  SimpleImputer(missing_values=nan, strategy='mean', fill_value=None,verbose=0, copy=True, add_indicator=False)|  |  Imputation transformer for completing missing values.|  |  Read more in the :ref:`User Guide <impute>`.|  |  Parameters|  ----------|  missing_values : number, string, np.nan (default) or None|      The placeholder for the missing values. All occurrences of|      `missing_values` will be imputed.|  |  strategy : string, default='mean'|      The imputation strategy.|  |      - If "mean", then replace missing values using the mean along|        each column. Can only be used with numeric data.|      - If "median", then replace missing values using the median along|        each column. Can only be used with numeric data.|      - If "most_frequent", then replace missing using the most frequent|        value along each column. Can be used with strings or numeric data.|      - If "constant", then replace missing values with fill_value. Can be|        used with strings or numeric data.

评分:mae误差 16619.07644

3. Categorical Variables 文字变量处理

分类变量处理方法:

  • 直接丢弃,如果没有用的话
  • Label Encoding 标记编码:比如频率:“Never” (0) < “Rarely” (1) < “Most days” (2) < “Every day” (3),将字符串分类成几类,用数字表示,特征存在内在顺序 (ordinal feature)
  • One-Hot Encoding,特征无内在顺序,会在数据里新生成一系列的列,一般来说最后一种效果最好,但是特征中值的种类过多的话,该方法会把数据集扩的比较大
# Get list of categorical variables,获取非数字类变量
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)print("Categorical variables:")
print(object_cols)
Categorical variables:
['Type', 'Method', 'Regionname'] # 特征名称
  1. 直接丢弃
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
  1. Label Encoding
from sklearn.preprocessing import LabelEncoder# Make copy to avoid changing original data
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
for col in object_cols:label_X_train[col] = label_encoder.fit_transform(X_train[col])label_X_valid[col] = label_encoder.transform(X_valid[col])
  1. One-Hot Encoding
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))# One-hot encoding removed index; put it back,放回idx
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1) # 丢弃原有的文字列,只剩数字
num_X_valid = X_valid.drop(object_cols, axis=1)# Add one-hot encoded columns to numerical features # 数字列和编码后的文本特征列合并
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

遇见训练集和测试集的文字变量种类不一样:

  • 检查哪些特征在两个集合里都是一样的,不一样的话直接编码会出错
# All categorical columns
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]# Columns that can be safely label encoded
good_label_cols = [col for col in object_cols if set(X_train[col]) == set(X_valid[col])]# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))
  • 这里处理的方法是,丢弃不一致的,对一致的进行编码转换
from sklearn.preprocessing import LabelEncoder# Drop categorical columns that will not be encoded
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)# Apply label encoder
labEncoder = LabelEncoder()
for feature in set(good_label_cols):label_X_train[feature] = labEncoder.fit_transform(label_X_train[feature])label_X_valid[feature] = labEncoder.transform(label_X_valid[feature])

查看文字特征里,有多少种变量值

# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])
[('Street', 2), # 街道有2个不同的值('Utilities', 2),('CentralAir', 2),。。。('Exterior2nd', 16),('Neighborhood', 25)] # 种数较多的不宜用one-hot,# 数据集扩大的很厉害,可以label-encoding,或丢弃

# Columns that will be one-hot encoded
# 不同数值数 < 10 的特征进行 one-hot编码
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]# Columns that will be dropped from the dataset
# 剩余的(两个set做差),丢弃
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))
from sklearn.preprocessing import OneHotEncoder# one_hot编码器
ohEnc = OneHotEncoder(handle_unknown='ignore', sparse=False)# 不同数值数 < 10 的特征one_hot编码
OH_X_train = pd.DataFrame(ohEnc.fit_transform(X_train[low_cardinality_cols]))
OH_X_valid = pd.DataFrame(ohEnc.transform(X_valid[low_cardinality_cols]))# 编码后index丢失,再加上
OH_X_train.index = X_train.index
OH_X_valid.index = X_valid.index# 数字特征(原数据丢弃文字特征,即得到)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)# 合并 数字特征 + one_hot编码(记得恢复index)后的文字特征(特征数值种类多的丢弃了)
OH_X_train = pd.concat([OH_X_train, num_X_train], axis=1)
OH_X_valid = pd.concat([OH_X_valid, num_X_valid], axis=1)

下一篇 :【Kaggle】Intermediate Machine Learning(管道+交叉验证)

【Kaggle】Intermediate Machine Learning(缺失值+文字特征处理)相关推荐

  1. 【Kaggle】Intermediate Machine Learning(管道+交叉验证)

    文章目录 4. Pipelines 管道 5. Cross-Validation 交叉验证 上一篇:[Kaggle]Intermediate Machine Learning(缺失值+文字特征处理) ...

  2. Kaggle课程 — 机器学习进阶 Intermediate Machine Learning

    Kaggle课程 - 机器学习进阶 Intermediate Machine Learning 1.简介 1.1 先决条件 2.缺失值 2.1 简介 2.2 三种方法 2.3 一个例子 2.3.1 定 ...

  3. 【Kaggle】Intermediate Machine Learning(XGBoost + Data Leakage)

    文章目录 6. XGBoost 7. Data Leakage 数据泄露 上一篇:[Kaggle]Intermediate Machine Learning(管道+交叉验证) 6. XGBoost 参 ...

  4. Kaggle | Titanic - Machine Learning from Disaster【泰坦尼克号生存预测】 | baseline及优秀notebook总结

    文章目录 一.数据介绍 二.代码 三.代码优化方向 一.数据介绍   Titanic - Machine Learning from Disaster是主要针对机器学习初学者开展的比赛,数据格式比较简 ...

  5. 【Machine Learning】机器学习の特征

    绘制了一张导图,有不对的地方欢迎指正: 下载地址 机器学习中,特征是很关键的.其中包括,特征的提取和特征的选择.他们是降维的两种方法,但又有所不同: 特征抽取(Feature Extraction): ...

  6. Java软件研发工程师转行之深度学习(Deep Learning)进阶:手写数字识别+人脸识别+图像中物体分类+视频分类+图像与文字特征+猫狗分类

    本文适合于对机器学习和数据挖掘有所了解,想深入研究深度学习的读者 1.对概率基本概率有所了解 2.具有微积分和线性代数的基本知识 3.有一定的编程基础(Python) Java软件研发工程师转行之深度 ...

  7. 机器学习案例学习【每周一例】之 Titanic: Machine Learning from Disaster

     下面一文章就总结几点关键: 1.要学会观察,尤其是输入数据的特征提取时,看各输入数据和输出的关系,用绘图看! 2.训练后,看测试数据和训练数据误差,确定是否过拟合还是欠拟合: 3.欠拟合的话,说明模 ...

  8. 【kaggle入门题一】Titanic: Machine Learning from Disaster

    原题: Start here if... You're new to data science and machine learning, or looking for a simple intro ...

  9. Machine Learning | (1) Scikit-learn与特征工程

    Machine Learning | 机器学习简介 Machine Learning | (1) Scikit-learn与特征工程 Scikit-learn与特征工程 "数据决定了机器学习 ...

最新文章

  1. 常用windows命令
  2. hdu1358 最小循环节,最大循环次数 KMP
  3. 《Redis实战》一第一部分 入门
  4. 机器学习从入门到精通系列之BP神经网络理论知识详解
  5. Spring Data JPA 条件查询的关键字
  6. tableview 的小 点点,
  7. bzoj 1079: [SCOI2008]着色方案
  8. 【文件系统】NTFS、FAT32、exFAT
  9. mybatis强化(一)基本配置补充
  10. 转载 openlayers 3.0 教程
  11. Unity编辑器定制和开发插件
  12. 朴素贝叶斯—豆瓣Top250影评的情感分析与预测
  13. Docker中部署.NET CORE应用(控制台应用程序篇)
  14. 产业科技创新杂志产业科技创新杂志社产业科技创新编辑部2022年第3期目录
  15. 【数字化】赵国栋:数字经济各要素的重构和演变
  16. android 放大缩小命令,Android TV开发中常用命令
  17. 【人工智能】2017年中国人工智能技术——智能语音应用报告
  18. 头条的动态页面爬取+百度下拉搜索框
  19. RuntimeError: expected scalar type Double but found Float
  20. Python进阶系列 - 20讲 with ... as:

热门文章

  1. self 实例对象-代码详细解释
  2. 在 Pycharm下使Python2和Python3共用Anaconda中的各种库/包的解决方法
  3. 关于在软件中添加扫描二维码功能的详细步骤及对应的资源。
  4. 【原】webpack--文件监听的原理
  5. MS CRM 2011 RC中的新特性(9)—全新的工作流 脚本设计模式
  6. 项目中获取系统的用例的基本步骤
  7. url编码函数encodeURI和encodeURIComponent
  8. C#基于LibUsbDotNet实现USB通信(一)
  9. asp.net中ADO.NET连接SQL数据库代码和连接Access数据库代码
  10. Visual Studio 2010 调试 C 语言程序