机器学习一百天-day3多元线性回归及虚拟变量陷阱分析

一,数据预处理

导入数据集

import pandas as pd
import numpy as npdataset = pd.read_csv('D:\\100Days\datasets\\50_Startups.csv')
X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : , 4 ].values

数据集(head(5)):

  R&D Spend  Administration  Marketing Spend       State     Profit
0  165349.20       136897.80        471784.10    New York  192261.83
1  162597.70       151377.59        443898.53  California  191792.06
2  153441.51       101145.55        407934.54     Florida  191050.39
3  144372.41       118671.85        383199.62    New York  182901.99
4  142107.34        91391.77        366168.42     Florida  166187.94

将类别数据数字化

from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder = LabelEncoder()
onehotencoder = OneHotEncoder(categorical_features=[3])
X[ : , 3] = labelencoder.fit_transform(X[: , 3 ])
X = onehotencoder.fit_transform(X).toarray()
#print(X)

先用labelencoder将字符串编码为0,1,2,再用onrhotencoder编码

此时的X :

[[0.0000000e+00 0.0000000e+00 1.0000000e+00 1.6534920e+05 1.3689780e+054.7178410e+05][1.0000000e+00 0.0000000e+00 0.0000000e+00 1.6259770e+05 1.5137759e+054.4389853e+05][0.0000000e+00 1.0000000e+00 0.0000000e+00 1.5344151e+05 1.0114555e+054.0793454e+05][0.0000000e+00 0.0000000e+00 1.0000000e+00 1.4437241e+05 1.1867185e+053.8319962e+05][0.0000000e+00 1.0000000e+00 0.0000000e+00 1.4210734e+05 9.1391770e+043.6616842e+05][0.0000000e+00 0.0000000e+00 1.0000000e+00 1.3187690e+05 9.9814710e+043.6286136e+05][1.0000000e+00 0.0000000e+00 0.0000000e+00 1.3461546e+05 1.4719887e+051.2771682e+05][0.0000000e+00 1.0000000e+00 0.0000000e+00 1.3029813e+05 1.4553006e+053.2387668e+05][0.0000000e+00 0.0000000e+00 1.0000000e+00 1.2054252e+05 1.4871895e+053.1161329e+05][1.0000000e+00 0.0000000e+00 0.0000000e+00 1.2333488e+05 1.0867917e+053.0498162e+05][0.0000000e+00 1.0000000e+00 0.0000000e+00 1.0191308e+05 1.1059411e+052.2916095e+05][1.0000000e+00 0.0000000e+00 0.0000000e+00 1.0067196e+05 9.1790610e+042.4974455e+05][0.0000000e+00 1.0000000e+00 0.0000000e+00 9.3863750e+04 1.2732038e+052.4983944e+05][1.0000000e+00 0.0000000e+00 0.0000000e+00 9.1992390e+04 1.3549507e+052.5266493e+05][0.0000000e+00 1.0000000e+00 0.0000000e+00 1.1994324e+05 1.5654742e+052.5651292e+05][0.0000000e+00 0.0000000e+00 1.0000000e+00 1.1452361e+05 1.2261684e+052.6177623e+05][1.0000000e+00 0.0000000e+00 0.0000000e+00 7.8013110e+04 1.2159755e+052.6434606e+05][0.0000000e+00 0.0000000e+00 1.0000000e+00 9.4657160e+04 1.4507758e+052.8257431e+05][0.0000000e+00 1.0000000e+00 0.0000000e+00 9.1749160e+04 1.1417579e+052.9491957e+05][0.0000000e+00 0.0000000e+00 1.0000000e+00 8.6419700e+04 1.5351411e+050.0000000e+00][1.0000000e+00 0.0000000e+00 0.0000000e+00 7.6253860e+04 1.1386730e+052.9866447e+05][0.0000000e+00 0.0000000e+00 1.0000000e+00 7.8389470e+04 1.5377343e+052.9973729e+05][0.0000000e+00 1.0000000e+00 0.0000000e+00 7.3994560e+04 1.2278275e+053.0331926e+05][0.0000000e+00 1.0000000e+00 0.0000000e+00 6.7532530e+04 1.0575103e+053.0476873e+05][0.0000000e+00 0.0000000e+00 1.0000000e+00 7.7044010e+04 9.9281340e+041.4057481e+05][1.0000000e+00 0.0000000e+00 0.0000000e+00 6.4664710e+04 1.3955316e+051.3796262e+05][0.0000000e+00 1.0000000e+00 0.0000000e+00 7.5328870e+04 1.4413598e+051.3405007e+05][0.0000000e+00 0.0000000e+00 1.0000000e+00 7.2107600e+04 1.2786455e+053.5318381e+05][0.0000000e+00 1.0000000e+00 0.0000000e+00 6.6051520e+04 1.8264556e+051.1814820e+05][0.0000000e+00 0.0000000e+00 1.0000000e+00 6.5605480e+04 1.5303206e+051.0713838e+05][0.0000000e+00 1.0000000e+00 0.0000000e+00 6.1994480e+04 1.1564128e+059.1131240e+04][0.0000000e+00 0.0000000e+00 1.0000000e+00 6.1136380e+04 1.5270192e+058.8218230e+04][1.0000000e+00 0.0000000e+00 0.0000000e+00 6.3408860e+04 1.2921961e+054.6085250e+04][0.0000000e+00 1.0000000e+00 0.0000000e+00 5.5493950e+04 1.0305749e+052.1463481e+05][1.0000000e+00 0.0000000e+00 0.0000000e+00 4.6426070e+04 1.5769392e+052.1079767e+05][0.0000000e+00 0.0000000e+00 1.0000000e+00 4.6014020e+04 8.5047440e+042.0551764e+05][0.0000000e+00 1.0000000e+00 0.0000000e+00 2.8663760e+04 1.2705621e+052.0112682e+05][1.0000000e+00 0.0000000e+00 0.0000000e+00 4.4069950e+04 5.1283140e+041.9702942e+05][0.0000000e+00 0.0000000e+00 1.0000000e+00 2.0229590e+04 6.5947930e+041.8526510e+05][1.0000000e+00 0.0000000e+00 0.0000000e+00 3.8558510e+04 8.2982090e+041.7499930e+05][1.0000000e+00 0.0000000e+00 0.0000000e+00 2.8754330e+04 1.1854605e+051.7279567e+05][0.0000000e+00 1.0000000e+00 0.0000000e+00 2.7892920e+04 8.4710770e+041.6447071e+05][1.0000000e+00 0.0000000e+00 0.0000000e+00 2.3640930e+04 9.6189630e+041.4800111e+05][0.0000000e+00 0.0000000e+00 1.0000000e+00 1.5505730e+04 1.2738230e+053.5534170e+04][1.0000000e+00 0.0000000e+00 0.0000000e+00 2.2177740e+04 1.5480614e+052.8334720e+04][0.0000000e+00 0.0000000e+00 1.0000000e+00 1.0002300e+03 1.2415304e+051.9039300e+03][0.0000000e+00 1.0000000e+00 0.0000000e+00 1.3154600e+03 1.1581621e+052.9711446e+05][1.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 1.3542692e+050.0000000e+00][0.0000000e+00 0.0000000e+00 1.0000000e+00 5.4205000e+02 5.1743150e+040.0000000e+00][1.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 1.1698380e+054.5173060e+04]]

View Code

存在所谓的虚拟变量陷阱。意思就是:其实state只有3种取值,理论上2位二进制就可以表示,而这里用100,010,001三种表示。其实若把第一位统一去掉,变为00,10,01也是可以区分的。所以这里需要做一个处理:
躲避虚拟变量陷阱,把第一列去掉了

X = X[ : ,1:]
print(X)

[[0.0000000e+00 1.0000000e+00 1.6534920e+05 1.3689780e+05 4.7178410e+05][0.0000000e+00 0.0000000e+00 1.6259770e+05 1.5137759e+05 4.4389853e+05][1.0000000e+00 0.0000000e+00 1.5344151e+05 1.0114555e+05 4.0793454e+05][0.0000000e+00 1.0000000e+00 1.4437241e+05 1.1867185e+05 3.8319962e+05][1.0000000e+00 0.0000000e+00 1.4210734e+05 9.1391770e+04 3.6616842e+05][0.0000000e+00 1.0000000e+00 1.3187690e+05 9.9814710e+04 3.6286136e+05][0.0000000e+00 0.0000000e+00 1.3461546e+05 1.4719887e+05 1.2771682e+05][1.0000000e+00 0.0000000e+00 1.3029813e+05 1.4553006e+05 3.2387668e+05][0.0000000e+00 1.0000000e+00 1.2054252e+05 1.4871895e+05 3.1161329e+05][0.0000000e+00 0.0000000e+00 1.2333488e+05 1.0867917e+05 3.0498162e+05][1.0000000e+00 0.0000000e+00 1.0191308e+05 1.1059411e+05 2.2916095e+05][0.0000000e+00 0.0000000e+00 1.0067196e+05 9.1790610e+04 2.4974455e+05][1.0000000e+00 0.0000000e+00 9.3863750e+04 1.2732038e+05 2.4983944e+05][0.0000000e+00 0.0000000e+00 9.1992390e+04 1.3549507e+05 2.5266493e+05][1.0000000e+00 0.0000000e+00 1.1994324e+05 1.5654742e+05 2.5651292e+05][0.0000000e+00 1.0000000e+00 1.1452361e+05 1.2261684e+05 2.6177623e+05][0.0000000e+00 0.0000000e+00 7.8013110e+04 1.2159755e+05 2.6434606e+05][0.0000000e+00 1.0000000e+00 9.4657160e+04 1.4507758e+05 2.8257431e+05][1.0000000e+00 0.0000000e+00 9.1749160e+04 1.1417579e+05 2.9491957e+05][0.0000000e+00 1.0000000e+00 8.6419700e+04 1.5351411e+05 0.0000000e+00][0.0000000e+00 0.0000000e+00 7.6253860e+04 1.1386730e+05 2.9866447e+05][0.0000000e+00 1.0000000e+00 7.8389470e+04 1.5377343e+05 2.9973729e+05][1.0000000e+00 0.0000000e+00 7.3994560e+04 1.2278275e+05 3.0331926e+05][1.0000000e+00 0.0000000e+00 6.7532530e+04 1.0575103e+05 3.0476873e+05][0.0000000e+00 1.0000000e+00 7.7044010e+04 9.9281340e+04 1.4057481e+05][0.0000000e+00 0.0000000e+00 6.4664710e+04 1.3955316e+05 1.3796262e+05][1.0000000e+00 0.0000000e+00 7.5328870e+04 1.4413598e+05 1.3405007e+05][0.0000000e+00 1.0000000e+00 7.2107600e+04 1.2786455e+05 3.5318381e+05][1.0000000e+00 0.0000000e+00 6.6051520e+04 1.8264556e+05 1.1814820e+05][0.0000000e+00 1.0000000e+00 6.5605480e+04 1.5303206e+05 1.0713838e+05][1.0000000e+00 0.0000000e+00 6.1994480e+04 1.1564128e+05 9.1131240e+04][0.0000000e+00 1.0000000e+00 6.1136380e+04 1.5270192e+05 8.8218230e+04][0.0000000e+00 0.0000000e+00 6.3408860e+04 1.2921961e+05 4.6085250e+04][1.0000000e+00 0.0000000e+00 5.5493950e+04 1.0305749e+05 2.1463481e+05][0.0000000e+00 0.0000000e+00 4.6426070e+04 1.5769392e+05 2.1079767e+05][0.0000000e+00 1.0000000e+00 4.6014020e+04 8.5047440e+04 2.0551764e+05][1.0000000e+00 0.0000000e+00 2.8663760e+04 1.2705621e+05 2.0112682e+05][0.0000000e+00 0.0000000e+00 4.4069950e+04 5.1283140e+04 1.9702942e+05][0.0000000e+00 1.0000000e+00 2.0229590e+04 6.5947930e+04 1.8526510e+05][0.0000000e+00 0.0000000e+00 3.8558510e+04 8.2982090e+04 1.7499930e+05][0.0000000e+00 0.0000000e+00 2.8754330e+04 1.1854605e+05 1.7279567e+05][1.0000000e+00 0.0000000e+00 2.7892920e+04 8.4710770e+04 1.6447071e+05][0.0000000e+00 0.0000000e+00 2.3640930e+04 9.6189630e+04 1.4800111e+05][0.0000000e+00 1.0000000e+00 1.5505730e+04 1.2738230e+05 3.5534170e+04][0.0000000e+00 0.0000000e+00 2.2177740e+04 1.5480614e+05 2.8334720e+04][0.0000000e+00 1.0000000e+00 1.0002300e+03 1.2415304e+05 1.9039300e+03][1.0000000e+00 0.0000000e+00 1.3154600e+03 1.1581621e+05 2.9711446e+05][0.0000000e+00 0.0000000e+00 0.0000000e+00 1.3542692e+05 0.0000000e+00][0.0000000e+00 1.0000000e+00 5.4205000e+02 5.1743150e+04 0.0000000e+00][0.0000000e+00 0.0000000e+00 0.0000000e+00 1.1698380e+05 4.5173060e+04]]

View Code

拆分数据集

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

二,在训练集上训练多元线性回归模型

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,Y_train)  #训练

三,在测试集上预测结果

#预测
Y_pred = regressor.predict(X_test)

也就是说其实多元线性回归与简单线性回归是基本一致的,区别就在于多元需要去掉虚拟变量陷阱,即去掉独热编码后的第一列,X = X[ : , 1: ]

四,可视化

#误差可视化
import pylab as plt
plt.plot((Y_test.min(),Y_test.max()),(Y_test.min(),Y_test.max()),color = 'blue')
plt.scatter(Y_test,Y_pred,color = 'red')
plt.xlabel("Y_test");plt.ylabel("y_pred")
plt.style.use('ggplot')
plt.show()

参考链接中解释了虚拟变量陷阱的原因,遇到的时候再回来看

链接:https://blog.csdn.net/ssswill/article/details/86151933

转载于:https://www.cnblogs.com/1113127139aaa/p/10272347.html

机器学习一百天-day3多元线性回归及虚拟变量陷阱分析相关推荐

  1. 100天机器学习(100-Days-Of-ML)day3多元线性回归及虚拟变量陷阱分析

    本系列为100天机器学习学习笔记.详细请参考下方作者链接: 100天机器学习github: https://github.com/MLEveryday/100-Days-Of-ML-Code Day3 ...

  2. 机器学习(二)多元线性回归算法预测房价

    机器学习(二)多元线性回归算法预测房价 本篇文章已作为重庆交通大学19级微课<机器视觉>大作业提交,提前声明,避免抄袭误会 "garbage in garbage out&quo ...

  3. 虚拟变量陷阱(Dummy Variable Trap)

    虚拟变量陷阱(Dummy Variable Trap):指当原特征有m个类别时,如果将其转换成m个虚拟变量,就会导致变量间出现完全共线性的情况. 假设我们有一个特征"性别",包含男 ...

  4. [RS] 类别数据数字化的方法 —— LabelEncoder VS OneHotEncoder (虚拟变量陷阱(Dummy Variable Trap))

    参考:虚拟变量陷阱原理及算例 - 知乎 LabelEncoder 和 OneHotEncoder  在数据处理过程中,我们有时需要对不连续的数字或者文本进行数字化处理. 在使用 Python 进行数据 ...

  5. 吴恩达机器学习(二)多元线性回归(假设、代价、梯度、特征缩放、多项式)

    目录 0. 前言 1. 假设函数(Hypothesis) 2. 代价函数(Cost Function) 3. 梯度下降(Gradient Descent) 4. 特征缩放(Feature Scalin ...

  6. 机器学习代码实现:多元线性回归(梯度下降法)吴恩达课后题目

    数据集 数据集如图:(面积,卧室数,价格),来自机器学习吴恩达的课后作业 预测价格 数据集下载:链接: https://pan.baidu.com/s/1MzUq1jPVlic5kkTGsXY87Q? ...

  7. 机器学习 Machine Learning中多元线性回归的学习笔记~

    1 前言 今天在做 Machine Learning中多元线性回归的作业~ 2 Machine Learning中多元线性回归 2.1 Feature Scaling和 Mean Normalizat ...

  8. Python实现机器学习二(实现多元线性回归)

    接着上一次的一元线性回归http://blog.csdn.net/lulei1217/article/details/49385531往下讲,这篇文章要讲解的多元线性回归. 1.什么是多元线性回归模型 ...

  9. 【python机器学习】普通最小二乘法多元线性回归

    普通最小二乘法线性回归 若数据集DDD由nnn个属性描述,则线性回归的假设函数为: hw,b(x)=∑i=1nwixi+b=wTx+bh_{\boldsymbol{w}, b}(\boldsymbol ...

最新文章

  1. pdf文件添加页码方法介绍
  2. 一只变成产品经理的狗
  3. docker安装nginx并进行-v挂载
  4. spring-bean依赖注入-03
  5. 一、自然语言处理概述
  6. C++文件操作的6种方式
  7. 云图说|玩转华为HiLens之端云协同AI开发
  8. python程序30行_Python30行代码实现对pdf文字内容的提取
  9. python小白从哪来开始-python 从小白开始 - 内置函数
  10. PC机中各类存储器的逻辑连接情况
  11. tftp工具_tftp,tftp等八款最佳的FTP客户端工具
  12. 软件测试周刊(第28期):越向前走,越有光明的前途。
  13. python之 ffmpeg给图片添加文字
  14. php dingo和jwt,DingoApi 中使用 JWT
  15. netty案例,netty4.1中级拓展篇六《SpringBoot+Netty+Elasticsearch收集日志信息数据存储》
  16. 被挂上热搜的“最惨程序员”:别等到失业,才想起专业
  17. Microsoft Edge 离线安装包下载
  18. idc机房建设费用_数据中心机房收费标准
  19. 苹果5概念机_iPhone x Fold概念机曝光,搭配A13处理器,还支持5G,价格多少合适?...
  20. 基于微信小程序的每日签到打卡

热门文章

  1. Java运行环境搭建与Hello world
  2. Power BI和Tableau对比分析,到底要学哪个?
  3. html div 不规则形状,div css布局,非html css布局
  4. 计算机组成原理毕业设计题目,计算机组成原理(毕业设计报告).doc
  5. Bray0-1.pcapng(Tomcat)
  6. Android-ABIFilter-Device supports x86,but APK only supports armeabi-v7a,armeabi,x86_64
  7. 旋转矩阵|万向锁|四元数
  8. 一文搞懂什么是数据仓库(Data Warehouse)数据仓库与数据库区别有哪些?什么是元数据?
  9. 20190613:多因子选股模型-思维导图
  10. 通往Groovy 3.0的漫漫长路,以及他们新改进的解析器