数据挖掘原理与算法:对森林火灾影响因素的分析
数据挖掘原理与算法:对森林火灾影响因素的分析
一、介绍
Forest Fire Area
Prediction of the burnt area by forest fires
Overview
The dataset contains 517 fires from the Montesinho natural park in Portugal. For each incident weekday, month, coordinates, and the burnt area are recorded, as well as several meteorological data such as rain, temperature, humidity, and wind. The workflow reads the data and trains a regression model based on the spatial, temporal, and weather variables.
简介
该数据集包含来自葡萄牙蒙特西尼奥自然公园的 517 起火灾。记录每个事件的工作日、月份、坐标和烧伤区域,以及雨、温度、湿度和风等多个气象数据。工作流读取数据并根据空间、时间和天气变量训练回归模型。
二、资源
Forest Fires Data Set
Forest Fires Data Set----predict the burned area of forest fires using meteorological and other data
加拿大森林火险气候指数系统FWI的原理及应用
三、代码
1.读取数据
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings(action='ignore')
fires = pd.read_csv('forestfires.csv')
2.数据清洗
2.1对数据进行处理,将日期数字化
fires = fires.reset_index()mapping_month = {'jan':1,'feb':2,'mar':3,'apr':4,'may':5,'jun':6,'jul':7,'aug':8,'sep':9,'oct':10,'nov':11,'dec':12,}
fires['month'] = fires['month'].map(mapping_month)mapping_day = {'mon':1,'tue':2,'wed':3,'thu':4,'fri':5,'sat':6,'sun':0}
fires['day'] = fires['day'].map(mapping_day)
2.2查看数据特征
fires.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
index | 517.0 | 258.000000 | 149.389312 | 0.0 | 129.0 | 258.00 | 387.00 | 516.00 |
X | 517.0 | 4.669246 | 2.313778 | 1.0 | 3.0 | 4.00 | 7.00 | 9.00 |
Y | 517.0 | 4.299807 | 1.229900 | 2.0 | 4.0 | 4.00 | 5.00 | 9.00 |
month | 517.0 | 7.475822 | 2.275990 | 1.0 | 7.0 | 8.00 | 9.00 | 12.00 |
day | 517.0 | 2.972921 | 2.143867 | 0.0 | 1.0 | 3.00 | 5.00 | 6.00 |
FFMC | 517.0 | 90.644681 | 5.520111 | 18.7 | 90.2 | 91.60 | 92.90 | 96.20 |
DMC | 517.0 | 110.872340 | 64.046482 | 1.1 | 68.6 | 108.30 | 142.40 | 291.30 |
DC | 517.0 | 547.940039 | 248.066192 | 7.9 | 437.7 | 664.20 | 713.90 | 860.60 |
ISI | 517.0 | 9.021663 | 4.559477 | 0.0 | 6.5 | 8.40 | 10.80 | 56.10 |
temp | 517.0 | 18.889168 | 5.806625 | 2.2 | 15.5 | 19.30 | 22.80 | 33.30 |
RH | 517.0 | 44.288201 | 16.317469 | 15.0 | 33.0 | 42.00 | 53.00 | 100.00 |
wind | 517.0 | 4.017602 | 1.791653 | 0.4 | 2.7 | 4.00 | 4.90 | 9.40 |
rain | 517.0 | 0.021663 | 0.295959 | 0.0 | 0.0 | 0.00 | 0.00 | 6.40 |
area | 517.0 | 12.847292 | 63.655818 | 0.0 | 0.0 | 0.52 | 6.57 | 1090.84 |
2.3
对预测结果进行处理,由特征可知,烧毁面积的均值为12.847292,前99.613%的数据都小于279,前75%的数据都小于6.57,前50%的数据都小于0.52,前47%的数据都小于0.09。
所以有理由推测当烧毁面积大于0.09、小于6.57的时候,发生了小型火灾;当烧毁面积大于6.57、小于279的时候,发生了中型火灾;当烧毁面积大于279的时候,发生了大型火灾;
fires['area'][fires['area']<=0.09] = 0
fires['area'][(fires['area']>0.09) & (fires['area']<=6.57)] = 1
fires['area'][(fires['area']>6.57) & (fires['area']<=279)] = 2
fires['area'][fires['area']>279] = 3
2.4查看特征之间的相关性
attributes = ['month','day','FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH','wind','rain']
corr = fires[attributes].corr()
corr
month | day | FFMC | DMC | DC | ISI | temp | RH | wind | rain | |
---|---|---|---|---|---|---|---|---|---|---|
month | 1.000000 | -0.037469 | 0.291477 | 0.466645 | 0.868698 | 0.186597 | 0.368842 | -0.095280 | -0.086368 | 0.013438 |
day | -0.037469 | 1.000000 | 0.073597 | 0.028697 | 0.001913 | 0.035926 | 0.032233 | -0.083318 | -0.004013 | -0.024119 |
FFMC | 0.291477 | 0.073597 | 1.000000 | 0.382619 | 0.330512 | 0.531805 | 0.431532 | -0.300995 | -0.028485 | 0.056702 |
DMC | 0.466645 | 0.028697 | 0.382619 | 1.000000 | 0.682192 | 0.305128 | 0.469594 | 0.073795 | -0.105342 | 0.074790 |
DC | 0.868698 | 0.001913 | 0.330512 | 0.682192 | 1.000000 | 0.229154 | 0.496208 | -0.039192 | -0.203466 | 0.035861 |
ISI | 0.186597 | 0.035926 | 0.531805 | 0.305128 | 0.229154 | 1.000000 | 0.394287 | -0.132517 | 0.106826 | 0.067668 |
temp | 0.368842 | 0.032233 | 0.431532 | 0.469594 | 0.496208 | 0.394287 | 1.000000 | -0.527390 | -0.227116 | 0.069491 |
RH | -0.095280 | -0.083318 | -0.300995 | 0.073795 | -0.039192 | -0.132517 | -0.527390 | 1.000000 | 0.069410 | 0.099751 |
wind | -0.086368 | -0.004013 | -0.028485 | -0.105342 | -0.203466 | 0.106826 | -0.227116 | 0.069410 | 1.000000 | 0.061119 |
rain | 0.013438 | -0.024119 | 0.056702 | 0.074790 | 0.035861 | 0.067668 | 0.069491 | 0.099751 | 0.061119 | 1.000000 |
2.5查看 加拿大森林火险气候指数系统FWI 中各个参数之间的相关性
from pandas.plotting import scatter_matrixattributes = ['FFMC', 'DMC', 'DC', 'ISI']
scatter_matrix(fires[attributes],figsize=(15, 15))
2.6画出散点图,查看属性DMC(粗腐殖质湿度码)与DC(干旱码)之间的关系
fires.plot(kind="scatter", x="DMC", y="DC", alpha=0.4, figsize=(10,8))
2.7使用极端森林回归模型,进行建模
from sklearn.ensemble import ExtraTreesRegressorcolumns = ['X', 'Y','month','day','FFMC', 'DMC', 'DC', 'ISI', 'temp','RH', 'wind', 'rain']
X = fires[columns]
Y = fires[['area']].values.ravel()model = ExtraTreesRegressor(n_estimators=100)
model.fit(X, Y)
ExtraTreesRegressor()
2.8通过查看模型中的特征影响程度,删除影响程度极低的特征
cols_to_drop = []
for c in zip(columns,model.feature_importances_.round(4)):if c[1] <0.01:cols_to_drop.append(c[0])
print('Columns to be droped: ',cols_to_drop)
Columns to be droped: ['rain']
2.9通过各个属性的相关性矩阵,按照相关度递减的顺序输出与属性area相关的属性排序
corr_matrix = fires.corr()
corr_matrix["area"].sort_values(ascending=False)
area 1.000000
index 0.302303
month 0.123613
wind 0.070217
X 0.068824
DC 0.063159
FFMC 0.059142
Y 0.047538
DMC 0.046503
rain 0.043600
temp 0.042614
ISI 0.022006
day 0.004167
RH -0.054193
Name: area, dtype: float64
2.10因为属性rain的重要性太低;属性DC与属性DMC的相关性过高,且相关性不如属性DMC,所以删除属性rain、DC所在列
fires = fires.drop(cols_to_drop,axis=1)
fires.drop(labels=['DC'],axis=1,inplace=True)
2.11绘制属性细小可燃物湿度码(FFMC)与粗腐殖质湿度码(DMC)的折线图,观察二者关系
import plotly.express as px
df_long=pd.melt(fires,id_vars=['index'], value_vars=['FFMC', 'DMC'])
fig = px.line(df_long, x='index', y='value', color='variable')
fig.show()
2.12绘制折线图查看另外三种属性:初始蔓延指数(ISI)、温度(temp)、风速(wind)之间的关系
df_long=pd.melt(fires,id_vars=['index'], value_vars=['ISI', 'temp', 'wind'])
fig = px.line(df_long, x='index', y='value', color='variable')
fig.show()
2.13对数据进行标准化处理,使之变为均值为0,标准差为1的归一化数据
fires_cat = fires[['month', 'day']]
fires_num = fires[['X', 'Y', 'FFMC', 'DMC', 'ISI', 'temp', 'RH','wind']]
target = fires[['area']]
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([('std_scaler', StandardScaler()),])
fires_num_tr = num_pipeline.fit_transform(fires_num)
2.14进行数据集的划分并进行降维处理
from sklearn.model_selection import train_test_split
data = np.concatenate((fires_cat,fires_num_tr),axis=1)
X_train, X_test, y_train, y_test = train_test_split(data, target.values, test_size=0.3)
y_train = y_train.ravel()
y_test = y_test.ravel()
3.开始进行SVR建模
3.1使用网格搜索进行SVR的调参
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
param_grid = [{'kernel': ['rbf', 'sigmoid'], 'C': [1,50, 100 ,300], 'epsilon': [0.2, 0.2,0.1]},]
svr_cv =SVR()
svr_grid_search = GridSearchCV(svr_cv, param_grid, cv=5,scoring='neg_mean_squared_error',return_train_score=True)
svr_grid_search.fit(X_train,y_train)
GridSearchCV(cv=5, estimator=SVR(),param_grid=[{'C': [1, 50, 100, 300], 'epsilon': [0.2, 0.2, 0.1],'kernel': ['rbf', 'sigmoid']}],return_train_score=True, scoring='neg_mean_squared_error')
3.2输出预计最优参数
svr_grid_search.best_estimator_
SVR(C=1, epsilon=0.2)
3.3进行预测
final_model = svr_grid_search.best_estimator_
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
final_predictions = final_model.predict(X_test)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
print('和均方误差SMSE为: ',final_rmse)
print('平均绝对误差MAE为: {}'.format(mean_absolute_error(y_test, final_predictions)))
和均方误差SMSE为: 0.9193740894801092
平均绝对误差MAE为: 0.7773110570888362
final_predictions[final_predictions==0] = 0
final_predictions[(final_predictions>0) & (final_predictions<=1)] = 1
final_predictions[(final_predictions>1) & (final_predictions<=2)] = 2
final_predictions[final_predictions>2] = 3
right_num = 0
for index in range(len(final_predictions)): if y_test[index] == final_predictions[index]:right_num = right_num + 1right = right_num / len(final_predictions) * 100
print('准确率为:', right)
准确率为: 25.64102564102564
3.4开始随机森林建模
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
param_grid = [{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},{'bootstrap': [False,True], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},]forest_reg = RandomForestRegressor()
rfr_grid_search = GridSearchCV(forest_reg, param_grid, cv=5,scoring='neg_mean_squared_error',return_train_score=True)
rfr_grid_search.fit(X_train,y_train)
GridSearchCV(cv=5, estimator=RandomForestRegressor(),param_grid=[{'max_features': [2, 4, 6, 8],'n_estimators': [3, 10, 30]},{'bootstrap': [False, True], 'max_features': [2, 3, 4],'n_estimators': [3, 10]}],return_train_score=True, scoring='neg_mean_squared_error')
3.5输出预计最优参数
rfr_grid_search.best_estimator_
RandomForestRegressor(max_features=4, n_estimators=10)
3.6进行测试集预测
final_predictions = final_model.predict(X_test)final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)print('和均方误差SMSE为: ',final_rmse)
print('平均绝对误差MAE为: {}'.format(mean_absolute_error(y_test, final_predictions)))
和均方误差SMSE为: 0.9193740894801092
平均绝对误差MAE为: 0.7773110570888362
final_predictions[final_predictions==0] = 0
final_predictions[(final_predictions>0) & (final_predictions<=1)] = 1
final_predictions[(final_predictions>1) & (final_predictions<=2)] = 2
final_predictions[final_predictions>2] = 3
right_num = 0
for index in range(len(final_predictions)): if y_test[index] == final_predictions[index]:right_num = right_num + 1right = right_num / len(final_predictions) * 100
print('准确率为:', right)
准确率为: 25.64102564102564
3.7开始h2o随机森林建模
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.grid.grid_search import H2OGridSearch
h2o.init()
3.8 h2o读取数据并进行处理
h2oFires = pd.read_csv('forestfires.csv')h2oFires['area'][h2oFires['area']<=0.09] = 0
h2oFires['area'][(h2oFires['area']>0.09) & (h2oFires['area']<=6.57)] = 1
h2oFires['area'][(h2oFires['area']>6.57) & (h2oFires['area']<=279)] = 2
h2oFires['area'][h2oFires['area']>279] = 3h2oFires['area'][h2oFires['area']==0] = 'fire0'
h2oFires['area'][h2oFires['area']==1] = 'fire1'
h2oFires['area'][h2oFires['area']==2] = 'fire2'
h2oFires['area'][h2oFires['area']==3] = 'fire3'trainCsv = h2oFires.sample(frac=0.7,axis=0)
testCsv = h2oFires.sample(frac=0.3,axis=0)trainCsv = trainCsv[['X','Y','month','day','FFMC','DMC','ISI','temp','RH','wind','area']]
testCsv = testCsv[['X','Y','month','day','FFMC','DMC','ISI','temp','RH','wind','area']]trainCsv.to_csv('h2oTrain.csv')
testCsv.to_csv('h2oTest.csv')train=h2o.import_file("h2oTrain.csv")
test=h2o.import_file("h2oTest.csv")
train=train[1:]
test=test[1:]
3.9进行建模
model1 = H2ORandomForestEstimator()
model1.train(x = train.names[0:-1],y = 'area',training_frame = train)
3.10使用得到的模型进行预测
predict=H2ORandomForestEstimator.predict(model1 ,test[test.names[0:-1]])
predict
out = test.concat(predict)
h2o.download_csv(out,"predict.csv")
3.11得到准确率
test_right = predict[predict['predict'] == test['area']].nrow
accuracy = test_right/test.nrow
print('准确率为:', accuracy*100)
准确率为: 82.58064516129032
3.12利用网格搜索进行最优参数调整
rf_params = {'ntrees': [x for x in range(100,200,1)],'max_depth': [50] }
rf_grid = H2OGridSearch(model = H2ORandomForestEstimator, hyper_params=rf_params)
rf_grid.train(x = train.names[0:-1],y = 'area',training_frame = train)
model4 = H2ORandomForestEstimator(ntrees=100,max_depth=50)
model4.train(x = train.names[0:-1],y = 'area',training_frame = train)predict=H2ORandomForestEstimator.predict(model4,test[test.names[0:-1]])
test_right = predict[predict['predict'] == test['area']].nrow
accuracy = test_right/test.nrow
print('准确率为:', accuracy*100)
准确率为: 85.16129032258064
数据挖掘原理与算法:对森林火灾影响因素的分析相关推荐
- 数据挖掘原理与算法:机器学习->{[sklearn. model_selection. train_test_split]、[h2o]、[网格搜索]、[numpy]、[plotly.express]}
数据挖掘原理与算法:机器学习->{[sklearn. model_selection. train_test_split].[h2o].[网格搜索].[numpy].[plotly.expres ...
- 数据挖掘原理与算法:练习题2
数据挖掘原理与算法:练习题2 题目: 下表给出了一组有关天气状况和能否进行户外活动的数据.请给出所有包含属性"Play"的频繁项集(最小支持度计数为3) No. Outlook T ...
- 数据挖掘原理与算法:Jupyter
数据挖掘原理与算法:Jupyter 一.认识Jupyter Jupyter介绍和使用 中文版 PIP(Python包管理工具) anaconda(开源的Python发行版本) Python,Anaco ...
- 数据挖掘原理与算法:练习题1
数据挖掘原理与算法:练习题1 题目: 考虑下表中的数据集,其中A.B.C为属性,+.-为类标号,构建一个决策树 A B C Number of instances + - T T T 5 0 F T ...
- 【天光学术】宏观经济论文:森林火灾损失评估分析(节选)
林雨轩 杨景海 摘要:森林火灾损失评估由其复杂性一直是评估领域重要关注点之一.从森林火災损失特点分析出发,结合过火林区处理方式与林木受损程度,分析了多种森林火灾发生后的损失评估方式,并列出相关计算公式 ...
- 数据挖掘原理与算法_资料 | 数据挖掘:概念、模型、方法和算法(第2版)/ 国外计算机科学经典教材...
下载地址: 以下书籍介绍来自图书商城 内容简介 · · · · · · 随着数据规模和复杂度的持续上升,分析员必须利用更高级的软件工具来执行间接的.自动的智能化数据分析.<数据挖掘:概念.模型. ...
- 数据挖掘原理与算法_技术分享|大数据挖掘算法之FPGrowth算法
程一舰 数据技术处 我们常说我们生活在信息时代,实际上,我们更多的还是生活在数据时代.因为从过去到现在累积了大量的数据,对数据的挖掘和分析也仅是从最近几年大数据和人工智能技术的发展而兴起.我们对现有数 ...
- 数据挖掘原理与算法 K-Means算法
K-Means算法用于实现聚类需求,以K为参数,把N个对象分为个簇,以使簇内具有较高的相似度. 具体实现上主要是一个循环找质心的过程,大体思路是先预处理数据,将所有点看成一个簇,找这个簇的质心,再选取 ...
- 数据挖掘原理与算法 DBSCAN
用C实现DBSCAN,完全就是暴力模拟,用三个向量存储核心点.噪声点.边界点,先标记核心点,之后先对核心点进行聚类,将在规定范围内的核心点放入一个向量,这时得到的聚类是包含重复的,再将这个暂时得到的核 ...
最新文章
- php方法中有%3cbr%3e报错,ecmall 标签以及格式化代码
- excel调用python编程-用Python如何开发Excel宏脚本?新手必学
- hdu1999 不可摸数 好题.
- 一块钱买一瓶水,两个空瓶换一瓶水,三个瓶盖换一瓶水,现在有20块钱,一共可以喝多少瓶水?
- 12月碎碎念-随便聊聊这一年
- 拆解前苏联产荧光数码管计算器,内部电路结构彪悍!
- 测绘技术设计规定最新版_公示 | 29家单位申报甲级测绘资质审查意见
- 中小学计算机教学大纲,中小学信息技术教材教法教学大纲
- PHPSTORM实用快捷键
- 做菜不好吃,你一定是忽略了这20个小技巧!
- js base64图片太大_JS实现base64图片下载 简易方法
- 一个华为人辞职创业后的几个反思【转】
- mysql_连接查询
- 从零开始学编程系列汇总
- 烤仔看世界 | “女王”的骗局
- LIC2020 百度语言与智能技术竞赛(一)——语义解析冠军方案
- ​微信公众平台用户信息相关接口调整通知2021-09-27​
- 正版 Windows 10安装教程
- NeurIPS 2020 | 基于协同集成与分发的协同显著性目标检测网络
- 基频和倍频的概念_一倍频分析