数据挖掘原理与算法:对森林火灾影响因素的分析



一、介绍

Forest Fire Area

Prediction of the burnt area by forest fires

Overview

The dataset contains 517 fires from the Montesinho natural park in Portugal. For each incident weekday, month, coordinates, and the burnt area are recorded, as well as several meteorological data such as rain, temperature, humidity, and wind. The workflow reads the data and trains a regression model based on the spatial, temporal, and weather variables.

简介

该数据集包含来自葡萄牙蒙特西尼奥自然公园的 517 起火灾。记录每个事件的工作日、月份、坐标和烧伤区域,以及雨、温度、湿度和风等多个气象数据。工作流读取数据并根据空间、时间和天气变量训练回归模型。



二、资源

Forest Fires Data Set

Forest Fires Data Set----predict the burned area of forest fires using meteorological and other data

加拿大森林火险气候指数系统FWI的原理及应用



三、代码

1.读取数据

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings(action='ignore')
fires = pd.read_csv('forestfires.csv')

2.数据清洗

2.1对数据进行处理,将日期数字化

fires = fires.reset_index()mapping_month = {'jan':1,'feb':2,'mar':3,'apr':4,'may':5,'jun':6,'jul':7,'aug':8,'sep':9,'oct':10,'nov':11,'dec':12,}
fires['month'] = fires['month'].map(mapping_month)mapping_day = {'mon':1,'tue':2,'wed':3,'thu':4,'fri':5,'sat':6,'sun':0}
fires['day'] = fires['day'].map(mapping_day)

2.2查看数据特征

fires.describe().T
count mean std min 25% 50% 75% max
index 517.0 258.000000 149.389312 0.0 129.0 258.00 387.00 516.00
X 517.0 4.669246 2.313778 1.0 3.0 4.00 7.00 9.00
Y 517.0 4.299807 1.229900 2.0 4.0 4.00 5.00 9.00
month 517.0 7.475822 2.275990 1.0 7.0 8.00 9.00 12.00
day 517.0 2.972921 2.143867 0.0 1.0 3.00 5.00 6.00
FFMC 517.0 90.644681 5.520111 18.7 90.2 91.60 92.90 96.20
DMC 517.0 110.872340 64.046482 1.1 68.6 108.30 142.40 291.30
DC 517.0 547.940039 248.066192 7.9 437.7 664.20 713.90 860.60
ISI 517.0 9.021663 4.559477 0.0 6.5 8.40 10.80 56.10
temp 517.0 18.889168 5.806625 2.2 15.5 19.30 22.80 33.30
RH 517.0 44.288201 16.317469 15.0 33.0 42.00 53.00 100.00
wind 517.0 4.017602 1.791653 0.4 2.7 4.00 4.90 9.40
rain 517.0 0.021663 0.295959 0.0 0.0 0.00 0.00 6.40
area 517.0 12.847292 63.655818 0.0 0.0 0.52 6.57 1090.84

2.3

对预测结果进行处理,由特征可知,烧毁面积的均值为12.847292,前99.613%的数据都小于279,前75%的数据都小于6.57,前50%的数据都小于0.52,前47%的数据都小于0.09。

所以有理由推测当烧毁面积大于0.09、小于6.57的时候,发生了小型火灾;当烧毁面积大于6.57、小于279的时候,发生了中型火灾;当烧毁面积大于279的时候,发生了大型火灾;

fires['area'][fires['area']<=0.09] = 0
fires['area'][(fires['area']>0.09) & (fires['area']<=6.57)] = 1
fires['area'][(fires['area']>6.57) & (fires['area']<=279)] = 2
fires['area'][fires['area']>279] = 3

2.4查看特征之间的相关性

attributes = ['month','day','FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH','wind','rain']
corr = fires[attributes].corr()
corr

month day FFMC DMC DC ISI temp RH wind rain
month 1.000000 -0.037469 0.291477 0.466645 0.868698 0.186597 0.368842 -0.095280 -0.086368 0.013438
day -0.037469 1.000000 0.073597 0.028697 0.001913 0.035926 0.032233 -0.083318 -0.004013 -0.024119
FFMC 0.291477 0.073597 1.000000 0.382619 0.330512 0.531805 0.431532 -0.300995 -0.028485 0.056702
DMC 0.466645 0.028697 0.382619 1.000000 0.682192 0.305128 0.469594 0.073795 -0.105342 0.074790
DC 0.868698 0.001913 0.330512 0.682192 1.000000 0.229154 0.496208 -0.039192 -0.203466 0.035861
ISI 0.186597 0.035926 0.531805 0.305128 0.229154 1.000000 0.394287 -0.132517 0.106826 0.067668
temp 0.368842 0.032233 0.431532 0.469594 0.496208 0.394287 1.000000 -0.527390 -0.227116 0.069491
RH -0.095280 -0.083318 -0.300995 0.073795 -0.039192 -0.132517 -0.527390 1.000000 0.069410 0.099751
wind -0.086368 -0.004013 -0.028485 -0.105342 -0.203466 0.106826 -0.227116 0.069410 1.000000 0.061119
rain 0.013438 -0.024119 0.056702 0.074790 0.035861 0.067668 0.069491 0.099751 0.061119 1.000000

2.5查看 加拿大森林火险气候指数系统FWI 中各个参数之间的相关性

from pandas.plotting import scatter_matrixattributes = ['FFMC', 'DMC', 'DC', 'ISI']
scatter_matrix(fires[attributes],figsize=(15, 15))

2.6画出散点图,查看属性DMC(粗腐殖质湿度码)与DC(干旱码)之间的关系

fires.plot(kind="scatter", x="DMC", y="DC", alpha=0.4, figsize=(10,8))

2.7使用极端森林回归模型,进行建模

from sklearn.ensemble import ExtraTreesRegressorcolumns = ['X', 'Y','month','day','FFMC', 'DMC', 'DC', 'ISI', 'temp','RH', 'wind', 'rain']
X = fires[columns]
Y = fires[['area']].values.ravel()model = ExtraTreesRegressor(n_estimators=100)
model.fit(X, Y)

ExtraTreesRegressor()

2.8通过查看模型中的特征影响程度,删除影响程度极低的特征

cols_to_drop = []
for c in zip(columns,model.feature_importances_.round(4)):if c[1] <0.01:cols_to_drop.append(c[0])
print('Columns to be droped: ',cols_to_drop)

Columns to be droped:  ['rain']

2.9通过各个属性的相关性矩阵,按照相关度递减的顺序输出与属性area相关的属性排序

corr_matrix = fires.corr()
corr_matrix["area"].sort_values(ascending=False)

area     1.000000
index    0.302303
month    0.123613
wind     0.070217
X        0.068824
DC       0.063159
FFMC     0.059142
Y        0.047538
DMC      0.046503
rain     0.043600
temp     0.042614
ISI      0.022006
day      0.004167
RH      -0.054193
Name: area, dtype: float64

2.10因为属性rain的重要性太低;属性DC与属性DMC的相关性过高,且相关性不如属性DMC,所以删除属性rain、DC所在列

fires = fires.drop(cols_to_drop,axis=1)
fires.drop(labels=['DC'],axis=1,inplace=True)

2.11绘制属性细小可燃物湿度码(FFMC)与粗腐殖质湿度码(DMC)的折线图,观察二者关系

import plotly.express as px
df_long=pd.melt(fires,id_vars=['index'], value_vars=['FFMC', 'DMC'])
fig = px.line(df_long, x='index', y='value', color='variable')
fig.show()

2.12绘制折线图查看另外三种属性:初始蔓延指数(ISI)、温度(temp)、风速(wind)之间的关系

df_long=pd.melt(fires,id_vars=['index'], value_vars=['ISI',   'temp',   'wind'])
fig = px.line(df_long, x='index', y='value', color='variable')
fig.show()

2.13对数据进行标准化处理,使之变为均值为0,标准差为1的归一化数据

fires_cat = fires[['month', 'day']]
fires_num = fires[['X', 'Y', 'FFMC', 'DMC', 'ISI', 'temp', 'RH','wind']]
target = fires[['area']]
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([('std_scaler', StandardScaler()),])
fires_num_tr = num_pipeline.fit_transform(fires_num)

2.14进行数据集的划分并进行降维处理

from sklearn.model_selection import train_test_split
data = np.concatenate((fires_cat,fires_num_tr),axis=1)
X_train, X_test, y_train, y_test = train_test_split(data, target.values, test_size=0.3)
y_train = y_train.ravel()
y_test = y_test.ravel()

3.开始进行SVR建模

3.1使用网格搜索进行SVR的调参

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
param_grid = [{'kernel': ['rbf', 'sigmoid'], 'C': [1,50, 100 ,300], 'epsilon': [0.2, 0.2,0.1]},]
svr_cv =SVR()
svr_grid_search = GridSearchCV(svr_cv, param_grid, cv=5,scoring='neg_mean_squared_error',return_train_score=True)
svr_grid_search.fit(X_train,y_train)

GridSearchCV(cv=5, estimator=SVR(),param_grid=[{'C': [1, 50, 100, 300], 'epsilon': [0.2, 0.2, 0.1],'kernel': ['rbf', 'sigmoid']}],return_train_score=True, scoring='neg_mean_squared_error')

3.2输出预计最优参数

svr_grid_search.best_estimator_

SVR(C=1, epsilon=0.2)

3.3进行预测

final_model = svr_grid_search.best_estimator_
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
final_predictions = final_model.predict(X_test)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
print('和均方误差SMSE为: ',final_rmse)
print('平均绝对误差MAE为: {}'.format(mean_absolute_error(y_test, final_predictions)))

和均方误差SMSE为:  0.9193740894801092
平均绝对误差MAE为: 0.7773110570888362
final_predictions[final_predictions==0] = 0
final_predictions[(final_predictions>0) & (final_predictions<=1)] = 1
final_predictions[(final_predictions>1) & (final_predictions<=2)] = 2
final_predictions[final_predictions>2] = 3
right_num = 0
for index in range(len(final_predictions)):       if y_test[index] == final_predictions[index]:right_num = right_num + 1right = right_num / len(final_predictions) * 100
print('准确率为:', right)

准确率为: 25.64102564102564

3.4开始随机森林建模

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
param_grid = [{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},{'bootstrap': [False,True], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},]forest_reg = RandomForestRegressor()
rfr_grid_search = GridSearchCV(forest_reg, param_grid, cv=5,scoring='neg_mean_squared_error',return_train_score=True)
rfr_grid_search.fit(X_train,y_train)

GridSearchCV(cv=5, estimator=RandomForestRegressor(),param_grid=[{'max_features': [2, 4, 6, 8],'n_estimators': [3, 10, 30]},{'bootstrap': [False, True], 'max_features': [2, 3, 4],'n_estimators': [3, 10]}],return_train_score=True, scoring='neg_mean_squared_error')

3.5输出预计最优参数

rfr_grid_search.best_estimator_

RandomForestRegressor(max_features=4, n_estimators=10)

3.6进行测试集预测

final_predictions = final_model.predict(X_test)final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)print('和均方误差SMSE为: ',final_rmse)
print('平均绝对误差MAE为: {}'.format(mean_absolute_error(y_test, final_predictions)))

和均方误差SMSE为:  0.9193740894801092
平均绝对误差MAE为: 0.7773110570888362
final_predictions[final_predictions==0] = 0
final_predictions[(final_predictions>0) & (final_predictions<=1)] = 1
final_predictions[(final_predictions>1) & (final_predictions<=2)] = 2
final_predictions[final_predictions>2] = 3
right_num = 0
for index in range(len(final_predictions)):      if y_test[index] == final_predictions[index]:right_num = right_num + 1right = right_num / len(final_predictions) * 100
print('准确率为:', right)

准确率为: 25.64102564102564

3.7开始h2o随机森林建模

import  h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.grid.grid_search import H2OGridSearch
h2o.init()

3.8 h2o读取数据并进行处理

h2oFires = pd.read_csv('forestfires.csv')h2oFires['area'][h2oFires['area']<=0.09] = 0
h2oFires['area'][(h2oFires['area']>0.09) & (h2oFires['area']<=6.57)] = 1
h2oFires['area'][(h2oFires['area']>6.57) & (h2oFires['area']<=279)] = 2
h2oFires['area'][h2oFires['area']>279] = 3h2oFires['area'][h2oFires['area']==0] = 'fire0'
h2oFires['area'][h2oFires['area']==1] = 'fire1'
h2oFires['area'][h2oFires['area']==2] = 'fire2'
h2oFires['area'][h2oFires['area']==3] = 'fire3'trainCsv = h2oFires.sample(frac=0.7,axis=0)
testCsv = h2oFires.sample(frac=0.3,axis=0)trainCsv = trainCsv[['X','Y','month','day','FFMC','DMC','ISI','temp','RH','wind','area']]
testCsv = testCsv[['X','Y','month','day','FFMC','DMC','ISI','temp','RH','wind','area']]trainCsv.to_csv('h2oTrain.csv')
testCsv.to_csv('h2oTest.csv')train=h2o.import_file("h2oTrain.csv")
test=h2o.import_file("h2oTest.csv")
train=train[1:]
test=test[1:]

3.9进行建模

model1 = H2ORandomForestEstimator()
model1.train(x = train.names[0:-1],y = 'area',training_frame = train)

3.10使用得到的模型进行预测

predict=H2ORandomForestEstimator.predict(model1 ,test[test.names[0:-1]])
predict
out = test.concat(predict)
h2o.download_csv(out,"predict.csv")

3.11得到准确率

test_right = predict[predict['predict'] == test['area']].nrow
accuracy = test_right/test.nrow
print('准确率为:', accuracy*100)

准确率为: 82.58064516129032

3.12利用网格搜索进行最优参数调整

rf_params = {'ntrees': [x for x in range(100,200,1)],'max_depth': [50] }
rf_grid = H2OGridSearch(model = H2ORandomForestEstimator, hyper_params=rf_params)
rf_grid.train(x = train.names[0:-1],y = 'area',training_frame = train)
model4 = H2ORandomForestEstimator(ntrees=100,max_depth=50)
model4.train(x = train.names[0:-1],y = 'area',training_frame = train)predict=H2ORandomForestEstimator.predict(model4,test[test.names[0:-1]])
test_right = predict[predict['predict'] == test['area']].nrow
accuracy = test_right/test.nrow
print('准确率为:', accuracy*100)

准确率为: 85.16129032258064

数据挖掘原理与算法:对森林火灾影响因素的分析相关推荐

  1. 数据挖掘原理与算法:机器学习->{[sklearn. model_selection. train_test_split]、[h2o]、[网格搜索]、[numpy]、[plotly.express]}

    数据挖掘原理与算法:机器学习->{[sklearn. model_selection. train_test_split].[h2o].[网格搜索].[numpy].[plotly.expres ...

  2. 数据挖掘原理与算法:练习题2

    数据挖掘原理与算法:练习题2 题目: 下表给出了一组有关天气状况和能否进行户外活动的数据.请给出所有包含属性"Play"的频繁项集(最小支持度计数为3) No. Outlook T ...

  3. 数据挖掘原理与算法:Jupyter

    数据挖掘原理与算法:Jupyter 一.认识Jupyter Jupyter介绍和使用 中文版 PIP(Python包管理工具) anaconda(开源的Python发行版本) Python,Anaco ...

  4. 数据挖掘原理与算法:练习题1

    数据挖掘原理与算法:练习题1 题目: 考虑下表中的数据集,其中A.B.C为属性,+.-为类标号,构建一个决策树 A B C Number of instances + - T T T 5 0 F T ...

  5. 【天光学术】宏观经济论文:森林火灾损失评估分析(节选)

    林雨轩 杨景海 摘要:森林火灾损失评估由其复杂性一直是评估领域重要关注点之一.从森林火災损失特点分析出发,结合过火林区处理方式与林木受损程度,分析了多种森林火灾发生后的损失评估方式,并列出相关计算公式 ...

  6. 数据挖掘原理与算法_资料 | 数据挖掘:概念、模型、方法和算法(第2版)/ 国外计算机科学经典教材...

    下载地址: 以下书籍介绍来自图书商城 内容简介 · · · · · · 随着数据规模和复杂度的持续上升,分析员必须利用更高级的软件工具来执行间接的.自动的智能化数据分析.<数据挖掘:概念.模型. ...

  7. 数据挖掘原理与算法_技术分享|大数据挖掘算法之FPGrowth算法

    程一舰 数据技术处 我们常说我们生活在信息时代,实际上,我们更多的还是生活在数据时代.因为从过去到现在累积了大量的数据,对数据的挖掘和分析也仅是从最近几年大数据和人工智能技术的发展而兴起.我们对现有数 ...

  8. 数据挖掘原理与算法 K-Means算法

    K-Means算法用于实现聚类需求,以K为参数,把N个对象分为个簇,以使簇内具有较高的相似度. 具体实现上主要是一个循环找质心的过程,大体思路是先预处理数据,将所有点看成一个簇,找这个簇的质心,再选取 ...

  9. 数据挖掘原理与算法 DBSCAN

    用C实现DBSCAN,完全就是暴力模拟,用三个向量存储核心点.噪声点.边界点,先标记核心点,之后先对核心点进行聚类,将在规定范围内的核心点放入一个向量,这时得到的聚类是包含重复的,再将这个暂时得到的核 ...

最新文章

  1. php方法中有%3cbr%3e报错,ecmall 标签以及格式化代码
  2. excel调用python编程-用Python如何开发Excel宏脚本?新手必学
  3. hdu1999 不可摸数 好题.
  4. 一块钱买一瓶水,两个空瓶换一瓶水,三个瓶盖换一瓶水,现在有20块钱,一共可以喝多少瓶水?
  5. 12月碎碎念-随便聊聊这一年
  6. 拆解前苏联产荧光数码管计算器,内部电路结构彪悍!
  7. 测绘技术设计规定最新版_公示 | 29家单位申报甲级测绘资质审查意见
  8. 中小学计算机教学大纲,中小学信息技术教材教法教学大纲
  9. PHPSTORM实用快捷键
  10. 做菜不好吃,你一定是忽略了这20个小技巧!
  11. js base64图片太大_JS实现base64图片下载 简易方法
  12. 一个华为人辞职创业后的几个反思【转】
  13. mysql_连接查询
  14. 从零开始学编程系列汇总
  15. 烤仔看世界 | “女王”的骗局
  16. LIC2020 百度语言与智能技术竞赛(一)——语义解析冠军方案
  17. ​微信公众平台用户信息相关接口调整通知2021-09-27​
  18. 正版 Windows 10安装教程
  19. NeurIPS 2020 | 基于协同集成与分发的协同显著性目标检测网络
  20. 基频和倍频的概念_一倍频分析

热门文章

  1. Novas Verdi、Debussy ,Synopsys VCS,Candence NC-Verilog,Mentor Graphics工具介绍
  2. python建立访客记录
  3. SharpWebMail介绍和安装(转)
  4. python 高阶函数
  5. 国内知名 IT 公司前端团队
  6. Bzoj1002 [FJOI2007]轮状病毒
  7. sphinx 全文搜索引擎
  8. VVOL、VASA — 为什么如此重要
  9. ORU-10027: buffer overflow, limit of 10000 bytes
  10. 如何制定恰当的信息安全策略