[学习笔记]数据与科学训练营之五:机器学习
1.机器学习概述
1.1 分类
监督学习(Supervised Learning)
无监督学习(Unspervised Learning)
强化学习(Reinforcement Learning)
1.2 基本用语
1.自变量
实验或者建模里用到的录入值或者原因对应的变量
也叫录入变量,说明变量
2.因变量
实验或者建模里结果或者效果对应的变量,类似于函数值
也叫输出变量
3.学习模型
为了在电脑上展示显示的因果关系,用单纯抽象化的一种概念模型
主要是为了展示因变量、自变量
2.监督学习
2.1 定义
学习可以活用已知类别的数据中的自变量来预测因变量(正确答案)的学习模型的开发方法。监督学习是从标记的训练数据来推断一个功能的机器学习任务。
2.2 特征
1.针对预测结果存在正确答案
2.对学习模型的评价以及优化
2.3 代表分析方法
1.回归(regression)
对大量统计数据进行数学处理,并确定因变量与某些自变量的相关关系,建立一个相关性较好的回归方程(函数表达式)
根据因变量和自变量的函数表达式分为:线性回归分析和非线性回归分析。
2.分类(Classification)
分类是一种基本的数据分析方式,根据其特点,可将数据对象划分为不同的部分和类型,再进一步分析,能够进一步挖掘事物的本质
分类算法:
1.逻辑回归 Logistic Regression
2.K最近邻居算法 KNN(k-Nearest Neighbor)
3.Naïve Bayes (朴素贝叶斯)
4.决策树Decision Tree
2.4 举例
1.身高/体重模型:用身高数据来预测体重
2.射箭模型:根据箭的长短,重量调整射箭力度
3.非监督学习
3.1 定义
根据类别未知(没有被标记)的训练样本解决模式识别中的各种问题,称之为无监督学习
3.2 特征
1.对于预期结果的正确答案:无
2.对于学习模型难以评价/优化
3.3 代表分析方法
1.集群
2.降维
3.4 举例
一个孩子,在一开始认识事物的时候,父母会给他一些苹果和橘子,但是并不告诉他哪儿个是苹果,哪儿个是橘子,而是让他自己根据两个事物的特征自己进行判断,会把苹果和橘子分到两个不同组中,下次再给孩子一个苹果,他会把苹果分到苹果组中,而不是分到橘子组中。
4. 回归分析
python库:
numpy --处理数组
statsmodels --创建回归
pandas --增强numpy
scipy – 包含大量计算,可视化函数库
sklearn --机器学习
matplotlib --二维绘图,表现numpy
seaborn
4.1 简单线性回归示例
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm#读取数据
data = pd.read_csv("1.01. Simple linear regression.csv")#描述数据
print(data.describe())"""SAT GPA
count 84.000000 84.000000
mean 1845.273810 3.330238
std 104.530661 0.271617
min 1634.000000 2.400000
25% 1772.000000 3.190000
50% 1846.000000 3.380000
75% 1934.000000 3.502500
max 2050.000000 3.810000
"""
# 1.假设两个变量存在回归关系
# y = b0 + b1x1
y = data["GPA"]
x1 = data["SAT"]
#1.1 绘图
plt.scatter(x1,y)
plt.xlabel("SAT",fontsize=20)
plt.ylabel("GPA",fontsize=20)
plt.show() # 如图1#1.2 拟合
x = sm.add_constant(x1)
results = sm.OLS(y,x).fit()
print(results.summary()) #如图2#1.3 绘制最优拟合线
plt.scatter(x1,y)
yhat = 0.0017*x1 + 0.275
fig = plt.plot(x1,yhat, lw=4, c="orange", label="regression line")
plt.xlabel("SAT",fontsize=20)
plt.ylabel("GPA",fontsize=20)
plt.show() #如图3
图1
图2
图3
4.2 变异性的分解
1.总离差查平方和 SST
已观察到的因变量机器平均值之间的差值平方和
2.回归平方和 SSR
预测值和平均值的差值平方和
如果SSR=SST, 那么这个模型就很完美
3.误差平方和 SSE
观察到的数值与预测数值的差值平方和
SSE = SST - SSR
误差越小,则估计能力越高
4.R^2
R^2 = 回归解释的可变性/总可变性 = SSR/SST
4.3 多重线性回归
房价不仅仅取决于房子大小,可能还取决于周围环境等。
4.3.1 示例
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm#美化图表样式
import seaborn as sns
sns.set()#读取数据
data = pd.read_csv("1.02. Multiple linear regression.csv")
# print(data.describe())
"""
" SAT GPA Rand 1,2,3
count 84.000000 84.000000 84.000000
mean 1845.273810 3.330238 2.059524
std 104.530661 0.271617 0.855192
min 1634.000000 2.400000 1.000000
25% 1772.000000 3.190000 1.000000
50% 1846.000000 3.380000 2.000000
75% 1934.000000 3.502500 3.000000
max 2050.000000 3.810000 3.000000
"""
# # 1.假设存在多元回归关系
# # y = b0 + b1x1 + b2*Rand1,2,3
y = data["GPA"]
x1 = data[["SAT", "Rand 1,2,3"]]
x = sm.add_constant(x1)
result = sm.OLS(y,x).fit()
print(result.summary())
因为Rand1,2,3的p值大于0.05, 所以我们不能拒绝76%显著性水平的零假设,而且变量Rand1,2,3反而使得分析偏差更大了。
4.3.2 F-Test
H0:同时让所有β为0
H1: 至少有一个β不为0
4.3.3 示例(引入相关变量)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm#美化图表样式
import seaborn as sns
sns.set()#读取数据
data = pd.read_csv("1.03. Dummies.csv")
# 标记Attendance
data = data.copy()
data["Attendance"] = data["Attendance"].map({"Yes":1,"No":0})
# print(data.describe())
"""SAT GPA Attendance
count 84.000000 84.000000 84.000000
mean 1845.273810 3.330238 0.464286
std 104.530661 0.271617 0.501718
min 1634.000000 2.400000 0.000000
25% 1772.000000 3.190000 0.000000
50% 1846.000000 3.380000 0.000000
75% 1934.000000 3.502500 1.000000
max 2050.000000 3.810000 1.000000"""
# # 1.假设存在多元回归关系
# # y = b0 + b1x1 + b2*Rand1,2,3
y = data["GPA"]
x1 = data[["SAT", "Attendance"]]
x = sm.add_constant(x1)
result = sm.OLS(y,x).fit()
print(result.summary())
我们发现调整R方升高了。
GPA = 0.6439 + 0.0014SAT + 0.2226Dummy
当Attendance为1:GPA = 0.6439 + 0.0014SAT + 0.22261=0.8665 +0.0014SAT
当Attendance为0:GPA = 0.6439 + 0.0014SAT + 0.22260=0.6439 + 0.0014SAT
绘制数据
##绘制数据
plt.scatter(data["SAT"], y, c=data["Attendance"], cmap="YlGnBu") #给点标记颜色
yhat_no = 0.6439 + 0.0014*data["SAT"]
yhat_yes = 0.8665 + 0.0014*data["SAT"]
fig = plt.plot(data["SAT"], yhat_no, lw=2, c="red")
fig = plt.plot(data["SAT"], yhat_yes, lw=2, c="green")
plt.xlabel("SAT", fontsize=20)
plt.xlabel("GPA", fontsize=20)
plt.sh
加入原回归线
##加入原回归线
plt.scatter(data["SAT"], y, c=data["Attendance"], cmap="YlGnBu") #给点标记颜色
yhat_no = 0.6439 + 0.0014*data["SAT"]
yhat_yes = 0.8665 + 0.0014*data["SAT"]
yhat = 0.0017*data["SAT"] + 0.275
fig = plt.plot(data["SAT"], yhat_no, lw=2, c="red", label="regression line1")
fig = plt.plot(data["SAT"], yhat_yes, lw=2, c="green", label="regression line2")
fig = plt.plot(data["SAT"], yhat, lw=2, c="blue", label="regression line")
plt.xlabel("SAT", fontsize=20)
plt.xlabel("GPA", fontsize=20)
plt.show()
预测
# # 1.假设存在多元回归关系
# # y = b0 + b1x1 + b2*Rand1,2,3
y = data["GPA"]
x1 = data[["SAT", "Attendance"]]
x = sm.add_constant(x1)
result = sm.OLS(y,x).fit()
# print(x) #由3列数据组成
"""const SAT Attendance
0 1.0 1714 0
1 1.0 1664 0
2 1.0 1760 0
3 1.0 1685 0
4 1.0 1693 0
.. ... ... ...
79 1.0 1936 1
80 1.0 1810 1
81 1.0 1987 0
82 1.0 1962 1
83 1.0 2050 1[84 rows x 3 columns]
"""
# 如果我们检查两个学生,
# A:SAT得到1700,并且参加课程数未超过75%
# B:SAT得到1670,并且参加课程数超过75%
new_data = pd.DataFrame({"const":1,"SAT":[1700,1670],"Attendance":[0,1]})
new_data = new_data[["const","SAT","Attendance"]]
print(new_data)
"""const SAT Attendance
0 1 1700 0
1 1 1670 1
"""
new_data= new_data.rename(index={0:"A",1:"B"})
print(new_data)
"""const SAT Attendance
A 1 1700 0
B 1 1670 1
"""
#预测
prodictions = result.predict(new_data)
print(prodictions)
#结果包含两个预测
"""
A 3.023513
B 3.204163
dtype: float64
"""
#转换为数据帧
prodictionsdf = pd.DataFrame({"Prodictions":prodictions})
joined = new_data.join(prodictionsdf)
joined.rename(index={0:"A",1:"B"})
print(joined)
"""const SAT Attendance Prodictions
A 1 1700 0 3.023513
B 1 1670 1 3.204163
"""
#结论:
#A毕业时大概可以得到学分3.023513
#B毕业时大概可以得到学分3.204163
5.Scikit-learn–线性回归
5.1 相关库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()from sklearn.linear_model import LinearRegression
5.2 基础
# # # 2.加载数据
# # data = pd.read_csv("1.01. Simple linear regression.csv")
# # # 3.创建回归
# # ## 3.1 申明自变量和因变量
# # x = data["SAT"]
# # x_matrix = x.values.reshape(-1,1)
# # y = data["GPA"]
# # ## 3.2 创建回归对象
# # reg = LinearRegression()
# # ## 3.3 拟合回归
# # reg.fit(x_matrix,y)
# # # LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
# #
# # #R方
# # print(reg.score(x_matrix,y)) #0.40600391479679765
# # #截距
# # print(reg.intercept_) #0.27504029966028076
# # #系数
# # print(reg.coef_) #[0.00165569]
# #
# # #预测:输入x得到y
# # print(reg.predict(np.array(1740).reshape(-1, 1))) #[3.15593751]
# #
# # new_data = pd.DataFrame(data=[1740,1760], columns=["SAT"])
# # print(reg.predict(new_data)) #[3.15593751 3.18905127]
# #
# # #添加到原始数据框
# # new_data["Predicted_GPA"] = reg.predict(new_data)
# # print(new_data)
# # """
# # SAT Predicted_GPA
# # 0 1740 3.155938
# # 1 1760 3.189051
# # """
# #
# # #绘制回归线
# # plt.scatter(x,y)
# # yhat = reg.coef_*x_matrix + reg.intercept_
# # fig = plt.plot(x, yhat, lw=4, c="red", label="regression line")
# # plt.xlabel("SAT", fontsize=20)
# # plt.xlabel("GPA", fontsize=20)
# # plt.show()
#
#
# #多重那个线性回归
# import numpy as np
# import pandas as pd
# import matplotlib.pyplot as plt
# import seaborn as sns
# sns.set()
#
# from sklearn.linear_model import LinearRegression
#
# #读取数据
# data = pd.read_csv("1.02. Multiple linear regression.csv")
# print(data.head())
#
# """
# SAT GPA Rand 1,2,3
# 0 1714 2.40 1
# 1 1664 2.52 3
# 2 1760 2.54 3
# 3 1685 2.74 3
# 4 1693 2.83 2
# """
# print(data.describe())
# """
# SAT GPA Rand 1,2,3
# count 84.000000 84.000000 84.000000
# mean 1845.273810 3.330238 2.059524
# std 104.530661 0.271617 0.855192
# min 1634.000000 2.400000 1.000000
# 25% 1772.000000 3.190000 1.000000
# 50% 1846.000000 3.380000 2.000000
# 75% 1934.000000 3.502500 3.000000
# max 2050.000000 3.810000 3.000000
# """
#
# #创建回归
# x = data[["SAT","Rand 1,2,3"]]
# y = data["GPA"]
# reg = LinearRegression()
# #拟合回归
# reg.fit(x,y)
#
# #两个变量的系数
# print(reg.coef_) #[ 0.00165354 -0.00826982]
#
# #截距
# print(reg.intercept_) #0.29603261264909486
#
# #R方
# r = reg.score(x,y)
# print(r) #0.4066811952814282
#
# #调整后的R方
# # R_adj = 1 - (1-R**2)*(n-1)/(n-p-1)
# """
# R**2: reg.score(x,y)
# # print(x.shape) (84, 2)
# n : 观察值
# p: 预测数
# """
# r2 = reg.score(x,y)
# n = x.shape[0]
# p = x.shape[1]
# adj_r2 = 1- (1-r2)*(n-1)/(n-p-1)
# print(adj_r2) #0.39203134825134
#
# #建测模型中不必要的变量 (p值大于0.5,我们就可以忽略变量)
# """
# 分别以每个变量创建回归,可以得到各自的p值
# """
# from sklearn.feature_selection import f_regression
#
# f = f_regression(x, y)
# print(f) #(array([56.04804786, 0.17558437]), array([7.19951844e-11, 6.76291372e-01]))
# #P值
# p_values = f_regression(x, y)[1]
# print(p_values) #[7.19951844e-11 6.76291372e-01]
# #保留三位小数
# print(p_values.round(3)) #[0. 0.676]
# #结论,SAT变量是有用的,Rand 1,2,3是无用的
#
# #把数据整理成表格
# reg_summary = pd.DataFrame(data=x.columns.values, columns=["Features"])
# print(reg_summary)
# """
# Features
# 0 SAT
# 1 Rand 1,2,3
# """
# reg_summary["Coefficients"] = reg.coef_
# reg_summary["p-values"] = p_values.round(3)
# print(reg_summary)
# """
# Features Coefficients p-values
# 0 SAT 0.001654 0.000
# 1 Rand 1,2,3 -0.008270 0.676
# """
#
# #功能缩放(标准化)
# from sklearn.preprocessing import StandardScaler
# scaler = StandardScaler()
# result = scaler.fit(x)
# x_scaled = scaler.transform(x)
# print(x_scaled)
#
# #拟合缩放
# reg = LinearRegression()
# reg.fit(x_scaled,y)
# print(reg.coef_)
# print(reg.intercept_)
#
# #添加到数据框架
# reg_summary = pd.DataFrame([["Intercept"],["SAT"],["Rand 1,2,3"]], columns=["Features"])
# reg_summary["Weights"] = reg.intercept_, reg.coef_[0],reg.coef_[1]
# print(reg_summary)
# """
# Weights:权重越接近0,影响越小
# Features Weights
# 0 Intercept 3.330238
# 1 SAT 0.171814
# 2 Rand 1,2,3 -0.007030
# """
# #使用标准化模型预测价值
# new_data = pd.DataFrame(data=[[1700,2],[1800,1]],columns=["SAT","Rand 1,2,3"])
# print(new_data)
# """
# SAT Rand 1,2,3
# 0 1700 2
# 1 1800 1
# """
# result = reg.predict(new_data)
# print(result) #[295.39979563 312.58821497], 我们发现预测值误差很大
# #被预测的数据也必须标准化
# new_data_scaled = scaler.transform(new_data)
# print(new_data_scaled)
#
# """
# [[-1.39811928 -0.07002087]
# [-0.43571643 -1.24637147]]
# """
# #使用标准化后的数据重新预测
# result = reg.predict(new_data_scaled)
# print(result.round(2)) #[3.09 3.26]
#
# #因为之前我得到结论:Rand 1,2,3对预测值几乎没有任何影响, 所以我们移除它预测,预测值是不变的
# reg_simple = LinearRegression()
# x_simple_matrix = x_scaled[:,0].reshape(-1,1)
# reg_simple.fit(x_simple_matrix,y)
# result = reg_simple.predict(new_data_scaled[:,0].reshape(-1,1))
# print(result.round(2)) #[3.09 3.26]
#
#
# #常见问题:
# # 1 过度拟合:过拟合是指为了得到一致假设而使假设变得过度严格
# # 我们需要将数据分为两部分:训练和测试
#
# # 2.低度拟合:模型没有捕获到数据的基本逻辑
#
# #怎么分割训练和测试数据
# import numpy as np
# from sklearn.model_selection import train_test_split
# #创建两个数组
# a = np.arange(1,101)
# b = np.arange(501,601)
#
# #分割数据
# # a_train, a_test = train_test_split(a,test_size=0.2)
# a_train, a_test, b_train, b_test = train_test_split(a, b,test_size=0.2,random_state=42)
# # a_train, a_test = train_test_split(a,test_size=0.2,shuffle=False)
# #test_size:2,8原则, shuffle:数据是否被打乱,random_state每次筛分的数据集相同
# print(a_test)
# print(b_test)
#
5.3 实例:汽车销售–根据二手车的规格预测二手车的价格
#
# # 1.导入相关库
# import numpy as np
# import pandas as pd
# import matplotlib.pyplot as plt
# import seaborn as sns
# sns.set()
#
# from sklearn.linear_model import LinearRegression
# # 2.加载数据
# raw_data = pd.read_csv("1.04. Real-life example.csv")
# print(raw_data.head())
# """
# Brand Price Body ... Registration Year Model
# 0 BMW 4200.0 sedan ... yes 1991 320
# 1 Mercedes-Benz 7900.0 van ... yes 1999 Sprinter 212
# 2 Mercedes-Benz 13300.0 sedan ... yes 2003 S 500
# 3 Audi 23000.0 crossover ... yes 2007 Q7
# 4 Toyota 18300.0 crossover ... yes 2011 Rav 4
#
# [5 rows x 9 columns]
# """
# # 3.清理数据
# ## 3.1 探索数据
# print(raw_data.describe(include="all"))
# """
# Brand Price Body ... Registration Year Model
# count 4345 4173.000000 4345 ... 4345 4345.000000 4345
# unique 7 NaN 6 ... 2 NaN 312
# top Volkswagen NaN sedan ... yes NaN E-Class
# freq 936 NaN 1649 ... 3947 NaN 199
# mean NaN 19418.746935 NaN ... NaN 2006.550058 NaN
# std NaN 25584.242620 NaN ... NaN 6.719097 NaN
# min NaN 600.000000 NaN ... NaN 1969.000000 NaN
# 25% NaN 6999.000000 NaN ... NaN 2003.000000 NaN
# 50% NaN 11500.000000 NaN ... NaN 2008.000000 NaN
# 75% NaN 21700.000000 NaN ... NaN 2012.000000 NaN
# max NaN 300000.000000 NaN ... NaN 2016.000000 NaN
#
# [11 rows x 9 columns]
# """
# ## 3.2 删除变量
# data = raw_data.drop(["Model"], axis=1)
# print(data.describe(include="all"))
# """
# Brand Price Body ... Engine Type Registration Year
# count 4345 4173.000000 4345 ... 4345 4345 4345.000000
# unique 7 NaN 6 ... 4 2 NaN
# top Volkswagen NaN sedan ... Diesel yes NaN
# freq 936 NaN 1649 ... 2019 3947 NaN
# mean NaN 19418.746935 NaN ... NaN NaN 2006.550058
# std NaN 25584.242620 NaN ... NaN NaN 6.719097
# min NaN 600.000000 NaN ... NaN NaN 1969.000000
# 25% NaN 6999.000000 NaN ... NaN NaN 2003.000000
# 50% NaN 11500.000000 NaN ... NaN NaN 2008.000000
# 75% NaN 21700.000000 NaN ... NaN NaN 2012.000000
# max NaN 300000.000000 NaN ... NaN NaN 2016.000000
#
# [11 rows x 8 columns]
# """
# ## 3.2 数据预处理
# ### 3.2.1 判断是否为空
# print(data.isnull())
# """
# Brand Price Body Mileage EngineV Engine Type Registration Year
# 0 False False False False False False False False
# 1 False False False False False False False False
# 2 False False False False False False False False
# 3 False False False False False False False False
# 4 False False False False False False False False
# ... ... ... ... ... ... ... ... ...
# 4340 False False False False False False False False
# 4341 False False False False False False False False
# 4342 False False False False False False False False
# 4343 False False False False True False False False
# 4344 False False False False False False False False
#
# [4345 rows x 8 columns]
# """
# ### 3.2.2 统计空数据
# print(data.isnull().sum())
# """
# Brand 0
# Price 172
# Body 0
# Mileage 0
# EngineV 150
# Engine Type 0
# Registration 0
# Year 0
# dtype: int64
# """
# ### 3.2.3 删除存在缺省值的观测数据
# # 经验:
# """
# 1.如果删除的观测值少于5%,则可以直接删除
# """
# data_no_mv = data.dropna(axis=0)
# print(data_no_mv.describe(include="all"))
# """
# Brand Price Body ... Engine Type Registration Year
# count 4025 4025.000000 4025 ... 4025 4025 4025.000000
# unique 7 NaN 6 ... 4 2 NaN
# top Volkswagen NaN sedan ... Diesel yes NaN
# freq 880 NaN 1534 ... 1861 3654 NaN
# mean NaN 19552.308065 NaN ... NaN NaN 2006.379627
# std NaN 25815.734988 NaN ... NaN NaN 6.695595
# min NaN 600.000000 NaN ... NaN NaN 1969.000000
# 25% NaN 6999.000000 NaN ... NaN NaN 2003.000000
# 50% NaN 11500.000000 NaN ... NaN NaN 2007.000000
# 75% NaN 21900.000000 NaN ... NaN NaN 2012.000000
# max NaN 300000.000000 NaN ... NaN NaN 2016.000000
#
# [11 rows x 8 columns]
#
# """
# ### 3.2.4 探索概率分布函数
# # sns.displot(data_no_mv["Price"])
# # plt.show() #我们预期想得到一个正态分布,实际上不是
#
# ## 3.2.5 剔除异常值
# # 异常值:于其他观测值举例异常的观测值,会影响回归
# #经验:删除前1%的观测结果(百分位数)
# # 例如:将数据保持在99%以下的百分位数
# q = data_no_mv["Price"].quantile(0.99)
# print(q) #129812.51999999981
# # 保留只低于99%的观测值
# data1 = data_no_mv[data_no_mv["Price"]<q]
# print(data1.describe())
# """
# Price Mileage EngineV Year
# count 3984.000000 3984.000000 3984.000000 3984.000000
# mean 17837.117460 165.116466 2.743770 2006.292922
# std 18976.268315 102.766126 4.956057 6.672745
# min 600.000000 0.000000 0.600000 1969.000000
# 25% 6980.000000 93.000000 1.800000 2002.750000
# 50% 11400.000000 160.000000 2.200000 2007.000000
# 75% 21000.000000 230.000000 3.000000 2011.000000
# max 129222.000000 980.000000 99.990000 2016.000000
# """
#
# ### 3.2.6 再次检查探索概率分布函数
# # sns.displot(data1["Price"])
#
# # 里程:Mileage也是类似
# q = data1["Mileage"].quantile(0.99)
# data2 = data1[data1["Mileage"]<q]
# # sns.displot(data2["Mileage"])
# # plt.show()
#
# # 发动机的容积
# EngV = pd.DataFrame(raw_data["EngineV"])
# EngV = EngV.dropna(axis=0)
# EngV.sort_values(by="EngineV")
# # print(EngV)
# #标记异常值:通过查询,我们知道生活中的发动机容积最高为6.5
# data3 = data2[data2["EngineV"]<6.5]
# # sns.displot(data3["EngineV"])
#
# # Year
# q = data3["Year"].quantile(0.01)
# data4 = data3[data3["Year"]>q]
# # sns.displot(data4["Year"])
#
# #最终数据
# data_cleaned = data4.reset_index(drop=True)
# print(data_cleaned.describe(include="all"))
# """
# Brand Price Body ... Engine Type Registration Year
# count 3867 3867.000000 3867 ... 3867 3867 3867.000000
# unique 7 NaN 6 ... 4 2 NaN
# top Volkswagen NaN sedan ... Diesel yes NaN
# freq 848 NaN 1467 ... 1807 3505 NaN
# mean NaN 18194.455679 NaN ... NaN NaN 2006.709853
# std NaN 19085.855165 NaN ... NaN NaN 6.103870
# min NaN 800.000000 NaN ... NaN NaN 1988.000000
# 25% NaN 7200.000000 NaN ... NaN NaN 2003.000000
# 50% NaN 11700.000000 NaN ... NaN NaN 2008.000000
# 75% NaN 21700.000000 NaN ... NaN NaN 2012.000000
# max NaN 129222.000000 NaN ... NaN NaN 2016.000000
#
# [11 rows x 8 columns]
#
# """
# ## 3.3 检查OLS假设
# #用散点图检查线性
# # f, (ax1,ax2,ax3) = plt.subplots(1,3,sharey=True, figsize=(15,3))
# # ax1.scatter(data_cleaned["Year"], data_cleaned["Price"])
# # ax1.set_title("Price&Year")
# # ax2.scatter(data_cleaned["EngineV"], data_cleaned["Price"])
# # ax2.set_title("Price&EngineV")
# # ax3.scatter(data_cleaned["Mileage"], data_cleaned["Price"])
# # ax3.set_title("Price&Mileage")
# # sns.displot(data_cleaned["Price"]) #我们发现价格不是正态分布的,是指数型的
# # plt.show()
#
# #取价格的对数,再次绘制
# log_price = np.log(data_cleaned["Price"])
# data_cleaned["log_price"] = log_price
# print(data_cleaned.head())
# """
# Brand Price Body ... Registration Year log_price
# 0 BMW 4200.0 sedan ... yes 1991 8.342840
# 1 Mercedes-Benz 7900.0 van ... yes 1999 8.974618
# 2 Mercedes-Benz 13300.0 sedan ... yes 2003 9.495519
# 3 Audi 23000.0 crossover ... yes 2007 10.043249
# 4 Toyota 18300.0 crossover ... yes 2011 9.814656
#
# [5 rows x 9 columns]
# """
# # f, (ax1,ax2,ax3) = plt.subplots(1,3,sharey=True, figsize=(15,3))
# # ax1.scatter(data_cleaned["Year"], data_cleaned["log_price"])
# # ax1.set_title("log_price&Year")
# # ax2.scatter(data_cleaned["EngineV"], data_cleaned["log_price"])
# # ax2.set_title("log_price&EngineV")
# # ax3.scatter(data_cleaned["Mileage"], data_cleaned["log_price"])
# # ax3.set_title("log_price&Mileage")
# # sns.displot(data_cleaned["log_price"])
# # plt.show()
# #删除列Price
# data_cleaned = data_cleaned.drop(["Price"],axis=1)
#
# #3.4数据可能存在多重共线性(vif:方差膨胀因子)
# from statsmodels.stats.outliers_influence import variance_inflation_factor
# variables = data_cleaned[["Mileage","EngineV", "Year"]]
# vif = pd.DataFrame()
# vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]
# vif["features"] = variables.columns
# print(vif)
#
# """
# VIF features
# 0 3.791584 Mileage
# 1 7.662068 EngineV
# 2 10.354854 Year
# vif=1:没有多重共线性
# 1<vif<5:是完全正常的
# vif>10:不正常
# """
# data_no_multicollinearity = data_cleaned.drop(["Year"],axis=1)
#
# #添加虚拟变量
# data_with_dummies = pd.get_dummies(data_no_multicollinearity,drop_first=True)
# print(data_with_dummies.columns)
# #获取所有列
# print(data_with_dummies.columns.values)
# cols = ['log_price', 'Mileage', 'EngineV', 'Brand_BMW', 'Brand_Mercedes-Benz',
# 'Brand_Mitsubishi', 'Brand_Renault', 'Brand_Toyota', 'Brand_Volkswagen',
# 'Body_hatch', 'Body_other', 'Body_sedan', 'Body_vagon', 'Body_van',
# 'Engine Type_Gas', 'Engine Type_Other', 'Engine Type_Petrol',
# 'Registration_yes']
# data_preprocessed = data_with_dummies[cols]
# print(data_preprocessed.head())
#
# # 3.5 创建回归
# targets = data_preprocessed["log_price"]
# inputs = data_preprocessed.drop(["log_price"],axis=1)
# ## 3.5.1 缩放数据
# from sklearn.preprocessing import StandardScaler
# scaler = StandardScaler()
# scaler.fit(inputs)
# inputs_scaled = scaler.transform(inputs)
# ## 3.5.2 拆分数据
# from sklearn.model_selection import train_test_split
# x_train,x_test, y_train,y_test = train_test_split(inputs_scaled, targets, test_size=0.2, random_state=365)
# ## 3.5.3 创建回归
# reg = LinearRegression()
# reg.fit(x_train,y_train)
# #检测线性回归
# y_hat = reg.predict(x_train)
# plt.scatter(y_train, y_hat)
# plt.xlabel("target y_train",size=18)
# plt.ylabel("predict y_hat",size=18)
# plt.xlim(6,13)
# plt.ylim(6,13)
#
# #残差检测:必须是正态分布
# sns.displot(y_train-y_hat)
# plt.title("PDF", size=18)
# # plt.show()
#
# #R方
# r2 = reg.score(x_train,y_train)
# print(r2) #0.744996578792662
#
# #截距
# print(reg.intercept_) #9.415239458021299
#
# #系数
# print(reg.coef_)
# # [-0.44871341 0.20903483 0.0142496 0.01288174 -0.14055166 -0.17990912
# # -0.06054988 -0.08992433 -0.1454692 -0.10144383 -0.20062984 -0.12988747
# # -0.16859669 -0.12149035 -0.03336798 -0.14690868 0.32047333]
#
# reg_summary = pd.DataFrame(inputs.columns.values,columns=["Feartures"])
# reg_summary["Weights"] = reg.coef_
# print(reg_summary)
# """
# Feartures Weights
# 0 Mileage -0.448713
# 1 EngineV 0.209035
# 2 Brand_BMW 0.014250
# 3 Brand_Mercedes-Benz 0.012882
# 4 Brand_Mitsubishi -0.140552
# 5 Brand_Renault -0.179909
# 6 Brand_Toyota -0.060550
# 7 Brand_Volkswagen -0.089924
# 8 Body_hatch -0.145469
# 9 Body_other -0.101444
# 10 Body_sedan -0.200630
# 11 Body_vagon -0.129887
# 12 Body_van -0.168597
# 13 Engine Type_Gas -0.121490
# 14 Engine Type_Other -0.033368
# 15 Engine Type_Petrol -0.146909
# 16 Registration_yes 0.320473
# """
# # 当权重都为0,那么变量将会是基准
# print(data_cleaned["Brand"].unique())
# # ['BMW' 'Mercedes-Benz' 'Audi' 'Toyota' 'Renault' 'Volkswagen' 'Mitsubishi']
# # 我们发现Audi没有在权重表中, Brand_BMW =0.014250, 那么BMW比Audi贵,Brand_Renault=0.179909,那么Renault比Audi便宜
#
# #测试
# y_hat_test= reg.predict(x_test)
# plt.scatter(y_test, y_hat_test, alpha=0.2)
# plt.xlabel("target y_test",size=18)
# plt.ylabel("predict y_hat_test",size=18)
# plt.xlim(6,13)
# plt.ylim(6,13)
# # plt.show()
#
# df_pf = pd.DataFrame(np.exp(y_hat_test), columns=["Prediction"])
# y_test = y_test.reset_index(drop=True)
# df_pf["Target"] = np.exp(y_test)
# df_pf["Residual"] = df_pf["Target"] - df_pf["Prediction"]
# df_pf["Difference%"] = np.absolute(df_pf["Residual"]/df_pf["Target"]*100)
# # print(df_pf)
# # print(df_pf.describe())
# # pd.options.display.max_rows=100
# new_df = df_pf.sort_values(by=["Difference%"])
# print(new_df.head(100))
6.逻辑回归(分类)
6.1 基础
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()from sklearn.cluster import KMeans
#1.读取数据
data = pd.read_csv("3.01. Country clusters.csv")
print(data)
#2. 绘图
plt.scatter(data["Longitude"],data["Latitude"])
plt.xlim(-180,180)
plt.ylim(-90,90)
# plt.show()
#3.聚集
#数据切片
x = data.iloc[:,1:3]
print(x)
kmeans = KMeans(3) #3 要生成的集群数
res = kmeans.fit(x)
#集群结果
identified_clusters = kmeans.fit_predict(x)
print(identified_clusters) #[0 0 2 2 2 1]
data_with_clusters = data.copy()
data_with_clusters["Cluster"] = identified_clusters
print(data_with_clusters)
"""Country Latitude Longitude Language Cluster
0 USA 44.97 -103.77 English 0
1 Canada 62.40 -96.80 English 0
2 France 46.75 2.40 French 2
3 UK 54.01 -2.53 English 2
4 Germany 51.15 10.40 German 2
5 Australia -25.45 133.11 English 1
"""
#绘制散点图
plt.scatter(data["Longitude"],data["Latitude"], c=data_with_clusters["Cluster"],cmap="rainbow")
plt.xlim(-180,180)
plt.ylim(-90,90)
# plt.show()
6.2 实例
#1.加载数据
raw_data = pd.read_csv("2.01. Admittance.csv")
data = raw_data.copy()
data["Admitted"] = data["Admitted"].map({"Yes":1,"No":0})
print(data)
"""[168 rows x 2 columns]SAT Admitted
0 1363 0
1 1792 1
2 1954 1
3 1653 0
4 1593 0
.. ... ...
163 1722 1
164 1750 1
165 1555 0
166 1524 0
167 1461 0[168 rows x 2 columns]
"""
#定义因变量
y = data["Admitted"]
#定义自变量
x1 = data["SAT"]#2.回归
x = sm.add_constant(x1)
reg_log = sm.Logit(y,x)
result_log = reg_log.fit()
print(result_log)
"""
Optimization terminated successfully.Current function value: 0.137766Iterations 10
<statsmodels.discrete.discrete_model.BinaryResultsWrapper object at 0x000001E00A4ADBE0>
"""#概要
print(result_log.summary())
"""Logit Regression Results
==============================================================================
Dep. Variable: Admitted No. Observations: 168
Model: Logit Df Residuals: 166
Method: MLE Df Model: 1
Date: Mon, 10 May 2021 Pseudo R-squ.: 0.7992
Time: 16:16:02 Log-Likelihood: -23.145
converged: True LL-Null: -115.26
Covariance Type: nonrobust LLR p-value: 5.805e-42
==============================================================================coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -69.9128 15.737 -4.443 0.000 -100.756 -39.070
SAT 0.0420 0.009 4.454 0.000 0.024 0.060
==============================================================================Possibly complete quasi-separation: A fraction 0.27 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified."""
[学习笔记]数据与科学训练营之五:机器学习相关推荐
- [学习笔记]数据与科学训练营之三:统计学
1.描述性统计基础概念 1.1 总体和样本 1.总体(母体) 所要研究事物的全体,我们所获得的数据称为参数(总体统计量). 2.样本 是研究的总体中的一部分,我们所获得的数据成为统计量. 1.2 数据 ...
- 【Vue】学习笔记-数据交互
[Vue]学习笔记-数据交互 概述 安装axios 通过axios向后台获取数据 数据交互的实现 常见错误总结 概述 基于Vue的web项目显示的所有服务端数据,都是通过Ajax获取的.官方推荐使用A ...
- Matlab学习笔记——数据文件定位
写在这里的初衷,一是备忘,二是希望得到高人指点,三是希望能遇到志同道合的朋友. 目录 数据文件定位 1.fseek函数 2.ftell函数 3.feof函数 数据文件定位 1.fseek函数 用于改变 ...
- 深度学习笔记第三门课 结构化机器学习项目 第一周 机器学习(ML)策略(1)...
本文是吴恩达老师的深度学习课程[1]笔记部分. 作者:黄海广[2] 主要编写人员:黄海广.林兴木(第四所有底稿,第五课第一二周,第三周前三节).祝彦森:(第三课所有底稿).贺志尧(第五课第三周底稿). ...
- 深度学习笔记第三门课 结构化机器学习项目 第二周 机器学习(ML)策略(2)...
本文是吴恩达老师的深度学习课程[1]笔记部分. 作者:黄海广[2] 主要编写人员:黄海广.林兴木(第四所有底稿,第五课第一二周,第三周前三节).祝彦森:(第三课所有底稿).贺志尧(第五课第三周底稿). ...
- 数据挖掘学习笔记-数据预处理部分(一)
数据挖掘笔记-数据预处理部分(一) 数据预处理-年数据部分 设置索引与标记 异常值与缺失值处理 股票行为标记 归一化 onehot特征编码 数据预处理-日数据部分 导入数据与标记 缺失数据处理 数据简 ...
- python 读grid 数据_科学网—Python_机器学习_总结14:Grid search - 李军的博文
机器学习中存在两类参数:通过训练数据学习得到的参数:---可认为是辨识得到的参数,例如模型系数: 在学习算法中单独需要优化的参数--超参.调优参数:---算法自身的系数,例如决策树的深度参数: Gri ...
- 【机器学习-学习笔记】吴恩达老师机器学习课的笔记(黄海广博士)
吴恩达老师机器学习课的笔记github地址:https://github.com/fengdu78/Coursera-ML-AndrewNg-Notes,黄海广博士原创,黄海广博士也是这门课的最初翻译 ...
- 吴恩达机器学习学习笔记第一章:绪论初识机器学习
一. 什么是机器学习(Machine Learning)? 首先学习的对象是电脑 学习指的其实就是算法 机器学习就是基于数据基于算法从数据中去提炼对事物的认知和规律 掌握了这些特征和规律后就可 ...
最新文章
- Angular7中引用外部JS文件
- shell之sed 记录
- 【研讨课】超临界水气化制氢技术 2018-06-11
- 游戏服务器的架构演进
- 草稿--Windows消息机制
- 166. 分数到小数
- Sysfs-设备模型上层容器
- 世上可有后悔药?服务器里建快照
- Qtcreator配置Ros环境
- [算法]机器人运动范围
- windows 下搭建 git 服务器 gogs
- element的loading的蒙版导致滚动条消失,页面抖动
- MCMC(MH算法)
- jsp java session_JSP Session
- 英特尔第十代处理器为什么不支持win7_为什么7代CPU不支持WIN7,原因是什么
- android 计时器函数,Android Chronometer控件实现计时器函数详解
- qq群关系数据库 mysql_QQ群关系数据库24.52G mdf源文件下载 附上使用教程
- 解决IDEA无法加载插件或者加载了插件下载超时的问题
- 【C语言培训2】 前言 C语言程序设计课程介绍
- FEKO几何模型的导出导入