sklearn中lstm_深度学习之路（一）：用LSTM网络做时间序列数据预测

简介

问题：有一组1维数据，可能是某商品的销售量，可能是股票的价格等，用深度学习模型来解决对该数据的预测问题，比如用前50个数据，来预测下一个数据。

首先，给出数据集：

前10行数据.png

接下来，通过对数据进行处理，以及模型的搭建和训练，最终得到想要的预测模型。

数据的读取及处理：

读取数据 load_data(filename, time_step)

使用pandas进行csv文件的读取，其中需要注意的是路径，即filename中要使用‘/’ 而不是'\'，另外，time_step变量，是用于设置以多少历史数据作为预测下一个数据的基础。按照题目简介，使用前50个数据，因此，time_step为50.

import time

import keras

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler

from keras.models import Sequential

from keras.layers import LSTM

from keras.layers import Dense, Activation, Dropout

def load_data(filename, time_step):

'''

filename:

instruction: file address, note '/'

time_step: int

instruction: how many previous samples are used to predict the next sample, it is the same with the time_steps of that in LSTM

'''

df = pd.read_csv(filename, header=None)

data = df.values

data = data.astype('float32') # confirm the type as 'float32'

data = data.reshape(data.shape[0], )

# plt.title('original data')

# plt.plot(data)

# plt.savefig('original data.png')

# plt.show()

# using a list variable to rebuild a dataset to store previous time_step samples and another predicted sample

result = []

for index in range(len(data) - time_step):

result.append(data[index:index + time_step + 1])

# variable 'result' can be (len(data)-time_step) * (time_step + 1), the last column is predicted sample.

return np.array(result)

在这里，使用list变量result，将50个历史数据与一个预测数据放在一行，因此最终result是一个维数为((len(data) - time_step), 51)的一个列表，当然后面还要转换成numpy，便于操作。

数据归一化以及划分训练测试集

首先将数据进行归一化，调用的是sklearn.preprocessing中的MiniMaxScaler。之后按照7:3的比例划分成训练集和测试集。

data = load_data('sp500.csv', 50)

# normalize the data and split it into train and test set

scaler = MinMaxScaler(feature_range=(0, 1))

dataset = scaler.fit_transform(data)

# define a variable to represent the ratio of train/total and split the dataset

train_count = int(0.7 * len(dataset))

x_train_set, x_test_set = dataset[:train_count, :-1], dataset[train_count:, :-1]

y_train_set, y_test_set = dataset[:train_count, -1], dataset[train_count:, -1]

# reshape the data to satisfy the input acquirement of LSTM

x_train_set = x_train_set.reshape(x_train_set.shape[0], 1, x_train_set.shape[1])

x_test_set = x_test_set.reshape(x_test_set.shape[0], 1, x_test_set.shape[1])

y_train_set = y_train_set.reshape(y_train_set.shape[0], 1)

y_test_set = y_test_set.reshape(y_test_set.shape[0], 1)

需要注意的是，如果不将y_train_set进行reshape的话，那么它的维度将会是(M, )这种向量形式，而不是一维数据。(M, )这种数据是机器学习里面最容易出现bug的来源。

构建模型

此处我们构建一个4层网络，每一层的神经元个数取自layer形参。需要说明的是，构建LSTM时，里面的参数，尤其是各种size令许多新手产生困惑。根据个人理解，LSTM中，需要设定的基础参数有两个，分别是units以及input_shape。

LSTM Cell(左) 及其unfold形式(右)

units:实际上指代的就是第一层隐藏层的输出神经元个数，即第二层隐藏层输入神经元的个数。

input_shape：官网中给出的形式如下：(samples, time_steps, features)。features实际上就是每个样本的维度。假如time_steps = t，其实就相当于将该神经元unfold成x0到xt-1，samples可省略。

在本例中，数据经过处理后，X的维度是(m, 50)，m是样本数,50是特征数(其实应该是time_steps)。因此，此处的time_steps = 1，features = 50。

def build_model(layer):

'''

layer: list

instruction: the number of neurons in each layer

'''

model = Sequential()

# set the first hidden layer and set the input dimension

model.add(LSTM(

input_shape=(1, layer[0]), units=layer[1], return_sequences=True

))

model.add(Dropout(0.2))

# add the second layer

model.add(LSTM(

units=layer[2], return_sequences=False

))

model.add(Dropout(0.2))

# add the output layer with a Dense

model.add(Dense(units=layer[3], activation='linear'))

model.compile(loss='mse', optimizer='adam')

return model

新手入门，对很多概念也不是特别清晰，若有幸得到大神指点，吾感激不尽。

模型训练及预测

构建好模型之后，使用训练集进行训练，以及使用测试集进行测试。

# train the model and use the validation part to validate

model.fit(x_train_set, y_train_set, batch_size=128, epochs=20, validation_split=0.2)

# do the prediction

y_predicted = model.predict(x_test_set)

其中，设置了validation_split，用于从训练集中划分出一部分来做验证集，对过拟合问题提出预警。有关validation_split的问题，可以参考https://www.jianshu.com/p/0c7af5fbcf72

画图

最后一步就是将预测的数据以图的形式表现出来，为了与原始数据进行比对，先将预测出的数据变换到与原始数据同单位的样子。在此，调用的是前文定义的scaler中的inverse_transform。

遇到的问题就是inverse_transform中提示y_test_set与变换前的数据尺寸不一致，想想也是这样子的。当初用scaler.fit_transform的时候，是对列数为51的数据做的，因此需要对y_test_set进行数据补充，使用hstack将y_test_set与一个0数组进行堆叠。

# plot the predicted curve and the original curve

# fill some zeros to get a (len, 51) array

temp = np.zeros((len(y_test_set), 50))

origin_temp = np.hstack((temp, y_test_set))

predict_temp = np.hstack((temp, y_predicted))

# tranform the data back to the original one

origin_test = scaler.inverse_transform(origin_temp)

predict_test = scaler.inverse_transform(predict_temp)

plot_curve(origin_test[:, -1], predict_test[:, -1])

若前文中的y_test_set不使用reshape调整为列数为1的array的话，此处就会出现bug，提示维度不一，因为reshape前的为(M, )的向量。

plot_curve函数如下：

def plot_curve(true_data, predicted_data):

'''

true_data: float32

instruction: the true test data

predicted_data: float32

instruction: the predicted data from the model

'''

plt.plot(true_data, label='True data')

plt.plot(predicted_data, label='Predicted data')

plt.legend()

plt.savefig('result.png')

plt.show()

结果如下：

预测结果.png

sklearn中lstm_深度学习之路（一）：用LSTM网络做时间序列数据预测相关推荐

我的机器学习入门之路（中）——深度学习(自然语言处理)
继上一篇<我的机器学习入门之路(上)--传统机器学习>,这一篇博客主要记录深度学习(主要是自然语言处理)这一块内容的学习过程.以下均将自然语言处理简称NLP. 这一块内容的学习路线分为三部 ...
浏览器中实现深度学习？有人分析了7个基于JS语言的DL框架
作者:仵冀颖编辑:H4O 本文中,作者基于WWW'19 论文提供的线索,详细解读了在浏览器中实现深度学习的可能性.可行性和性能现状.具体而言,作者重点分析了 7 个最近出现的基于JavaScript ...
我的三年自学深度学习之路
大家好,我是羽峰.今天要和大家分享的是研究生三年的生活,也是自己自学深度学习的三年,凭借三年自学,最终进入了大厂做了一名算法工程师. 目录 2018 2019 2020 2021 2018 考研因为失 ...
在浏览器中进行深度学习：TensorFlow.js (四）用基本模型对MNIST数据进行识别
2019独角兽企业重金招聘Python工程师标准>>> 在了解了TensorflowJS的一些基本模型的后,大家会问,这究竟有什么用呢?我们就用深度学习中被广泛使用的MINST数据集 ...
在OpenCV中基于深度学习的边缘检测
点击上方"小白学视觉",选择加"星标"或"置顶" 重磅干货,第一时间送达本文转自:AI算法与图像处理导读分析了Canny的优劣,并给出 ...
综述：NLP中的深度学习优势
[简介]自然语言处理(NLP)能够帮助智能型机器更好地理解人类的语言,实现基于语言的人机交流.目前随着计算能力的发展和大量语言数据的出现,推动了使用数据驱动方法自动进行语义分析的需求.由于深度学习方法 ...
自监督学习，如何从数据困境中拯救深度学习？
2020-02-03 05:35:08 作者 | Thalles Silva编译 | 翻译官balala 编辑 | 丛末大规模标注的数据集的出现是深度学习在计算机视觉领域取得巨大成功的关键因素之一. ...
图像识别中的深度学习【香港中文大学王晓刚】
深度学习发展历史深度学习是近十年来人工智能领域取得的重要突破.它在语音识别.自然语言处理.计算机视觉.图像与视频分析.多媒体等诸多领域的应用取得了巨大成功.现有的深度学习模型属于神经网络.神经网络的 ...
在浏览器中进行深度学习：TensorFlow.js (十二）异常检测算法
2019独角兽企业重金招聘Python工程师标准>>> 异常检测是机器学习领域常见的应用场景,例如金融领域里的信用卡欺诈,企业安全领域里的非法入侵,IT运维里预测设备的维护时间点等. ...

sklearn中lstm_深度学习之路（一）：用LSTM网络做时间序列数据预测

sklearn中lstm_深度学习之路（一）：用LSTM网络做时间序列数据预测相关推荐

最新文章

热门文章