北京二手房价分析及预测

数据来源

Housing Price of Beijing from 2011 to 2017 by Qichen Qiu
https://www.kaggle.com/ruiqurm/lianjia
主要变量有Lng（经度）、Lat（纬度）、tradeTime（交易时间）、totalPrice（总价）、price（每平米单价）、square（面积）、livingRoom（客厅数）、constructionTime（建造时间）等，详细内容可点击链接查看。

数据预处理

1. 离散变量赋值

将数字转换为文字：

df<-df%>%mutate(
buildingType=case_when(buildingType == 1 ~ "Tower",buildingType == 2 ~ "Bungalow",buildingType == 3 ~ "Plate/Tower",buildingType == 4 ~ "Plate")%>%as.factor(),
renovationCondition=case_when(renovationCondition == 1 ~ "Other",renovationCondition == 2 ~ "Rough",renovationCondition == 3 ~ "Simplicity",renovationCondition == 4 ~ "Hardcover")%>%as.factor(),
buildingStructure=case_when(buildingStructure == 1 ~ "Unknown",buildingStructure == 2 ~ "Mixed",buildingStructure == 3 ~ "Brick/Wood",buildingStructure == 4 ~ "Brick/Concrete",buildingStructure == 5 ~ "Steel",buildingStructure == 6 ~ "Steel/Concrete")%>%as.factor(),
elevator=case_when(elevator == 0 ~ "No_Elevator",elevator == 1 ~ "Has_Elevator")%>%as.factor(),
fiveYearsProperty=case_when(fiveYearsProperty == 0 ~ "GreaterThan5Years",fiveYearsProperty == 1 ~ "LessThan5Years")%>%as.factor(),
subway=case_when(subway == 0 ~ "No_subway",subway == 1 ~ "Near_subway")%>%as.factor(),
district=case_when(district == 1 ~ "DongCheng",district == 2 ~ "FengTai",district == 3 ~ "DaXing",district == 4 ~ "FaXing",district == 5 ~ "FangShang",district == 6 ~ "ChangPing",district == 7 ~ "ChaoYang",district == 8 ~ "HaiDian",district == 9 ~ "ShiJingShan",district == 10 ~ "XiCheng",district == 11 ~ "TongZhou",district == 12 ~ "ShunYi",district == 13 ~ "MenTouGou")%>%as.factor()
)

2. 去除不必要数据

url（数据链接）、id（交易id）、Cid（社区id）对房价并无影响可以删去
删除totalPrice（总房价）<100万元、price（单位房价）<1万元的数据
删除ladderRatio（梯户比）>=10的数据
将数据以交易年份分组，发现2011年之前的数据总量非常少，因此仅保留2011年及之后的数据

3.缺失值处理

总数据量318851行，缺失值分布如下：

除DOM（市场活跃天数）外，其他的缺失值相对于数据总量而言占比非常小，可直接删除

将数据按tradeTime（交易时间）排序发现220000行之前的DOM数值较小，之后的较大，因此以220000行为分割点分别计算平均值替换NA

dom<-df$DOM
dom_na<-which(is.na(dom))
small_val<-dom[1:220000]%>%mean(na.rm=T)
large_val<-dom[220001:length(dom)]%>%mean(na.rm=T)
dom[dom_na[dom_na<=220000]]<-small_val
dom[dom_na[dom_na>220000]]<-large_val
df$DOM<-dom

数据可视化

1. 每平米单价直方图

ggplot(data=df, aes(x=price, y=..density..))+geom_histogram(alpha=.8)+geom_density(color='skyblue')+theme_light()ggplot(data=df, aes(x=log(price), y=..density..))+geom_histogram(alpha=.8)+geom_density(color='skyblue')+theme_light()

每平米单价呈右偏分布，log变换后呈正态分布：

2. 重要特征分布图

Bungalow（平房）的价格大幅高于其他种类，这里指的可能是四合院
周边有地铁的二手房单价更高
有电梯的二手房单价价更高
五年产权对于单价似乎没有太大影响

区域对二手房单价有着重要影响，西城区最高，门头沟最低
西城区、东城区、海淀区大部分二手房单价都位于4w-6w之间，2w以下二手房几乎没有，且10w以上高价二手房比例相当高；其余地区大部分二手房单价位于2w-4w之间，除朝阳区外10w以上高价二手房比例非常低

建立模型

1. 数据划分

由于交易时间影响单价，不能随机划分训练集和测试集，否则测试准确度会比实际高。
将数据按交易时间排序，选取前80%作为训练集，后20%作为测试集。

2. 线性回归

之前提到了单价并不是正态分布，所以用log(price)进行建模。
由于总价=面积*单价，预测单价时需删除总价。

m.0<-lm(log(price)~1, data=train)
m.full<-lm(log(price)~.-totalPrice, data=train)
scope<-list(lower=formula(m.0), upper=formula(m.full))
fs<-step(object = m.0, scope = scope, direction = "forward", trace = 0)
be<-step(object = m.full, scope = scope, direction = "backward", trace = 0)

Forward Selection和Backward Elimination都保留了所有变量，使用Full Model进行之后的预测。
残差图：

可见残差大致是正态分布，拟合情况良好。
训练集 R2=0.8471R^{2}=0.8471R2=0.8471

pred.full<-predict(m.full, test)
mean(abs(exp(pred.full)-test$price))

MAE=14594.79MAE=14594.79MAE=14594.79

3. 决策树

决策树并没有正态分布的假设，可以直接使用price进行建模：

tree<-rpart(price~.-totalPrice, data=train, parms=list(split='information'), control = list(maxdepth=5))
pred.tree<-predict(tree, test)
mean(abs(pred.tree-test$price))

MAE=17774.07MAE=17774.07MAE=17774.07

4. Bagging

B=15
tree_pred<-matrix(nrow=B, ncol=nrow(test))
for(i in 1:B){index=sample(1:nrow(train), replace = T)tree<-rpart(price~.-totalPrice, data=train[index,], parms=list(split='information'), control = list(maxdepth=5))tree_pred[i, ]<-predict(tree,test)
}
pred.bagging<-apply(tree_pred, 2, mean)
mean(abs(pred.bagging-test$price))

MAE=17802.33MAE=17802.33MAE=17802.33

预测结果

线性回归得到的测试集mean absolute error（MAE）最小，使用full model对测试集最后97行进行预测的结果如图：

改进方向

考虑线性回归中各变量的交互（interaction effect）
由于房价随时间变化，可将数据转化为时间序列，使用ARIMA等模型
使用随机森林等机器学习算法