前一篇博文所整理的模型中,主要有ARMA、RL、SVM、LSTM方法,本篇主要以强化学习方法来解决相关问题。强化学习是关于Agent与环境之间进行的互动,通过不断与环境状况的交互来进行“学习”,在诸多的场景都取得了成功如alphaGo等,同样的,在金融市场中也通过交互来捕捉股票市场特征,用于指导交易进程也一定有效。

Q-Learning
话不多说,应用Q-Learning(博主博客已整理,不再赘述)思想,将动作空间变为 [买入,卖出,啥也不干],状态空间由时间窗口中的数据。通过最大化收益来指导强化学习。直接上代码笔记如下:

import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
from collections import deque
import randomdf = pd.read_csv('stockdata_year.csv')
#时间戳、开盘,最高,最低,收盘
print(df.head())class Agent:def __init__(self, state_size, window_size, trend, skip, batch_size):self.state_size = state_size #状态空间self.window_size = window_size #滑动窗口大小self.half_window = window_size // 2self.trend = trend  #dataself.skip = skip  #采取动作的步长,1代表每个时刻都操作self.action_size = 3 #动作空间-买入、卖出、观望self.batch_size = batch_sizeself.memory = deque(maxlen = 1000) #双向队列self.inventory = [] #仓位self.gamma = 0.95 #奖励衰减self.epsilon = 0.5 #贪婪系数self.epsilon_min = 0.01 #阈值self.epsilon_decay = 0.999 #低于阈值将损失部分系数tf.reset_default_graph()self.sess = tf.InteractiveSession() #交互式sessionself.X = tf.placeholder(tf.float32, [None, self.state_size]) #状态self.Y = tf.placeholder(tf.float32, [None, self.action_size]) #动作feed = tf.layers.dense(self.X, 256, activation = tf.nn.relu)self.logits = tf.layers.dense(feed, self.action_size) #计算3种动作的概率self.cost = tf.reduce_mean(tf.square(self.Y - self.logits)) #计算损失函数self.optimizer = tf.train.GradientDescentOptimizer(1e-5).minimize(self.cost) #优化器self.sess.run(tf.global_variables_initializer())def act(self, state): #选择动作if random.random() <= self.epsilon: #小于epsilon就随机探索return random.randrange(self.action_size)#不然就选择最好的动作return np.argmax( self.sess.run(self.logits, feed_dict = {self.X: state})[0])def get_state(self, t): #某t时刻的状态window_size = self.window_size + 1d = t - window_size + 1#早期天数不够窗口打小,用0时刻来凑,即填补相应个数block = self.trend[d : t + 1] if d >= 0 else -d * [self.trend[0]] + self.trend[0 : t + 1]res = []for i in range(window_size - 1):res.append(block[i + 1] - block[i]) #每步收益return np.array([res]) #作为状态编码def replay(self, batch_size):mini_batch = []l = len(self.memory)for i in range(l - batch_size, l):mini_batch.append(self.memory[i])#memoryreplay_size = len(mini_batch)X = np.empty((replay_size, self.state_size))Y = np.empty((replay_size, self.action_size))#新旧状态及Q值计算#[state, action, reward, next_state, done],故0和3分别获取新旧states = np.array([a[0][0] for a in mini_batch])new_states = np.array([a[3][0] for a in mini_batch])Q = self.sess.run(self.logits, feed_dict = {self.X: states})Q_new = self.sess.run(self.logits, feed_dict = {self.X: new_states})#更新Q表for i in range(len(mini_batch)):state, action, reward, next_state, done = mini_batch[i]target = Q[i]target[action] = rewardif not done: #如果没有结束target[action] += self.gamma * np.amax(Q_new[i])#结束了代表没有后续动作,直接等于X[i] = stateY[i] = targetcost, _ = self.sess.run([self.cost, self.optimizer], feed_dict = {self.X: X, self.Y: Y})#调整贪婪系数if self.epsilon > self.epsilon_min:self.epsilon *= self.epsilon_decayreturn costdef buy(self, initial_money):starting_money = initial_money #启动资金states_sell = []states_buy = []inventory = [] #仓位state = self.get_state(0) #初始状态for t in range(0, len(self.trend) - 1, self.skip):action = self.act(state) #根据状态选动作next_state = self.get_state(t + 1) #得到下一个状态#action=1为买入,资金够用,且剩下的长度足够if action == 1 and initial_money >= self.trend[t] and t < (len(self.trend) - self.half_window):inventory.append(self.trend[t]) #买入initial_money -= self.trend[t] #交易states_buy.append(t) #记录print('day %d: buy 1 unit at price %f, total balance %f'% (t, self.trend[t], initial_money))#action=2为卖出elif action == 2 and len(inventory):bought_price = inventory.pop(0) #卖出initial_money += self.trend[t] #交易states_sell.append(t) #记录#计算收益率try: invest = ((close[t] - bought_price) / bought_price) * 100except:invest = 0print('day %d, sell 1 unit at price %f, investment %f %%, total balance %f,'% (t, close[t], invest, initial_money))state = next_state #下一状态#计算收益invest = ((initial_money - starting_money) / starting_money) * 100total_gains = initial_money - starting_moneyreturn states_buy, states_sell, total_gains, investdef train(self, iterations, checkpoint, initial_money):#迭代多次for i in range(iterations):total_profit = 0 #累积利润inventory = []state = self.get_state(0)starting_money = initial_moneyfor t in range(0, len(self.trend) - 1, self.skip):action = self.act(state)next_state = self.get_state(t + 1)if action == 1 and starting_money >= self.trend[t] and t < (len(self.trend) - self.half_window):inventory.append(self.trend[t])starting_money -= self.trend[t]elif action == 2 and len(inventory) > 0:bought_price = inventory.pop(0)total_profit += self.trend[t] - bought_pricestarting_money += self.trend[t]invest = ((starting_money - initial_money) / initial_money)self.memory.append((state, action, invest, next_state, starting_money < initial_money))state = next_statebatch_size = min(self.batch_size, len(self.memory))cost = self.replay(batch_size)if (i+1) % checkpoint == 0:print('epoch: %d, total rewards: %f.3, cost: %f, total money: %f'%(i + 1, total_profit, cost, starting_money))close = df.Close.values.tolist() #选取收盘数据做测试
initial_money = 10000
window_size = 30
skip = 1
batch_size = 32
agent = Agent(state_size = window_size, window_size = window_size, trend = close, skip = skip, batch_size = batch_size)
agent.train(iterations = 200, checkpoint = 10, initial_money = initial_money)states_buy, states_sell, total_gains, invest = agent.buy(initial_money = initial_money)
fig = plt.figure(figsize = (15,5))
plt.plot(close, color='r', lw=2.)
plt.plot(close, '^', markersize=10, color='m', label = 'buying signal', markevery = states_buy)
plt.plot(close, 'v', markersize=10, color='k', label = 'selling signal', markevery = states_sell)
plt.title('total gains %f, total investment %f%%'%(total_gains, invest))
plt.legend()
plt.show()


DQN
话不多说,DQN(博主博客已整理,不再赘述),使用神经网络计算Q值,本质基于Q-Learning,所以代码上跟上述代码很多相似之处,其中主要的不同函数有:

#记忆库
def _memorize(self, state, action, reward, new_state, done):self.MEMORIES.append((state, action, reward, new_state, done))if len(self.MEMORIES) > self.MEMORY_SIZE:self.MEMORIES.popleft()
#构造记忆库
def _construct_memories(self, replay):states = np.array([a[0] for a in replay])new_states = np.array([a[3] for a in replay])Q = self.predict(states)Q_new = self.predict(new_states)Q_new_negative = self.sess.run(self.model_negative.logits, feed_dict={self.model_negative.X:new_states})replay_size = len(replay)X = np.empty((replay_size, self.state_size))Y = np.empty((replay_size, self.OUTPUT_SIZE))#更新Q值for i in range(replay_size):state_r, action_r, reward_r, new_state_r, done_r = replay[i]target = Q[i]target[action_r] = reward_rif not done_r:target[action_r] += self.GAMMA * Q_new_negative[i, np.argmax(Q_new[i])]X[i] = state_rY[i] = targetreturn X, Y#新旧网络assign
def _assign(self):for i in range(len(self.trainable)//2):assign_op = self.trainable[i+len(self.trainable)//2].assign(self.trainable[i])self.sess.run(assign_op)def train(self, iterations, checkpoint, initial_money):for i in range(iterations):total_profit = 0inventory = []state = self.get_state(0)starting_money = initial_moneyfor t in range(0, len(self.trend) - 1, self.skip):#定期复制assignif (self.T_COPY + 1) % self.COPY == 0:self._assign()action = self._select_action(state) #根据状态选动作next_state = self.get_state(t + 1) #得到下一状态if action == 1 and starting_money >= self.trend[t]:inventory.append(self.trend[t])starting_money -= self.trend[t]elif action == 2 and len(inventory) > 0:bought_price = inventory.pop(0)total_profit += self.trend[t] - bought_pricestarting_money += self.trend[t]invest = ((starting_money - initial_money) / initial_money)self._memorize(state, action, invest, next_state, starting_money < initial_money)#批次大小,若记录条数不够以记录条数为准batch_size = min(len(self.MEMORIES), self.BATCH_SIZE)#从记忆库采样replay = random.sample(self.MEMORIES, batch_size)#更新当前时间步state = next_state#存入记忆库X, Y = self._construct_memories(replay)cost, _ = self.sess.run([self.model.cost, self.model.optimizer], feed_dict={self.model.X: X, self.model.Y:Y})self.T_COPY += 1 #计数加1#贪婪系数衰减self.EPSILON = self.MIN_EPSILON + (1.0 - self.MIN_EPSILON) * np.exp(-self.DECAY_RATE * i)if (i+1) % checkpoint == 0:print('epoch: %d, total rewards: %f.3, cost: %f, total money: %f'%(i + 1, total_profit, cost,starting_money))

Actor Critic
话不多说,Actor Critic(博主博客已整理,不再赘述),不再以价值函数确定动作,而是直接学习动作策略。实质上使用了两个神经网络进行计算,值得注意的代码有:

#定义A和C两个类
class Actor:def __init__(self, name, input_size, output_size, size_layer):with tf.variable_scope(name):self.X = tf.placeholder(tf.float32, (None, input_size))feed_actor = tf.layers.dense(self.X, size_layer, activation = tf.nn.relu)self.logits = tf.layers.dense(feed_actor, output_size)class Critic:def __init__(self, name, input_size, output_size, size_layer, learning_rate):with tf.variable_scope(name):self.X = tf.placeholder(tf.float32, (None, input_size))self.Y = tf.placeholder(tf.float32, (None, output_size))self.REWARD = tf.placeholder(tf.float32, (None, 1))feed_critic = tf.layers.dense(self.X, size_layer, activation = tf.nn.relu)feed_critic = tf.layers.dense(feed_critic, output_size, activation = tf.nn.relu) + self.Yfeed_critic = tf.layers.dense(feed_critic, size_layer//2, activation = tf.nn.relu)self.logits = tf.layers.dense(feed_critic, 1)self.cost = tf.reduce_mean(tf.square(self.REWARD - self.logits))self.optimizer = tf.train.AdamOptimizer(learning_rate).minimize(self.cost)#网络assign
def _assign(self, from_name, to_name):from_w = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=from_name)to_w = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=to_name)for i in range(len(from_w)):assign_op = to_w[i].assign(from_w[i])self.sess.run(assign_op)def _construct_memories_and_train(self, replay):states = np.array([a[0] for a in replay])new_states = np.array([a[3] for a in replay])Q = self.sess.run(self.actor.logits, feed_dict={self.actor.X: states})Q_target = self.sess.run(self.actor_target.logits, feed_dict={self.actor_target.X: states})grads = self.sess.run(self.grad_critic, feed_dict={self.critic.X:states, self.critic.Y:Q})[0]self.sess.run(self.optimizer, feed_dict={self.actor.X:states, self.actor_critic_grad:grads})#计算rewardsrewards = np.array([a[2] for a in replay]).reshape((-1, 1))rewards_target = self.sess.run(self.critic_target.logits, feed_dict={self.critic_target.X:new_states,self.critic_target.Y:Q_target})for i in range(len(replay)):if not replay[0][-1]:rewards[i] += self.GAMMA * rewards_target[i]cost, _ = self.sess.run([self.critic.cost, self.critic.optimizer], feed_dict={self.critic.X:states, self.critic.Y:Q, self.critic.REWARD:rewards})return cost

博客参考完整代码:https://github.com/huseinzol05/Stock-Prediction-Models

强化学习用于金融时序问题(Q,DQN,AC)相关推荐

  1. ICML 2019 | 强化学习用于推荐系统,蚂蚁金服提出生成对抗用户模型(附论文下载链接)...

    选自arXiv 作者:Xinshi Chen.Shuang Li.Hui Li.Shaohua Jiang.Yuan Qi.Le Song 机器之心编译 参与:李诗萌.shooting 将强化学习用于 ...

  2. CS224n研究热点11 深度强化学习用于对话生成

    为什么80%的码农都做不了架构师?>>>    本文由码农场同步,最新版本请查看原文:http://www.hankcs.com/nlp/cs224n-deep-reinforcem ...

  3. 强化学习(二):Q learning 算法

    强化学习(一):基础知识 强化学习(二):Q learning算法 Q learning 算法是一种value-based的强化学习算法,Q是quality的缩写,Q函数 Q(state,action ...

  4. 强化学习(八) - 深度Q学习(Deep Q-learning, DQL,DQN)原理及相关实例

    深度Q学习原理及相关实例 8. 深度Q学习 8.1 经验回放 8.2 目标网络 8.3 相关算法 8.4 训练算法 8.5 深度Q学习实例 8.5.1 主程序 程序注释 8.5.2 DQN模型构建程序 ...

  5. 一种将 Tree-LSTM 的强化学习用于连接顺序选择的方法

    [导读] 本篇博客讲解的是 2020 年由清华大学李国良教授团队发表在 ICDE 上的论文,介绍它所提出的算法与实验结果,并结合实际情况给出一些思考. 原文链接: http://dbgroup.cs. ...

  6. 用于金融时序预测的神经网络:可改善经典的移动平均线策略

    北京 上海巡回站 | NVIDIA DLI深度学习培训 2018年1月26/1月12日 NVIDIA 深度学习学院 带你快速进入火热的DL领域 阅读全文                        ...

  7. 三篇强化学习用于多智能体路径规划的论文

    Multi-Robot Path Planning Method Using Reinforcement Learning 期刊:applied science MDPI 总结:使用VGG进行特征提取 ...

  8. 深度强化学习用于对话生成(论文笔记)

    一.如何定义一个好的对话 尽管SEQ2SEQ模式在对话生成方面取得了成功,但仍出现了两个问题(图1): 通过使用最大似然估计(MLE)目标函数预测给定会话上下文中的下一个对话转角来训练SEQ2SEQ模 ...

  9. 强化学习用于推荐系统 相关资料

    Richaed S.Sutton撰写的一本厚厚的书:<强化学习>    我整理的笔记 如何处理大规模离散动作空间 增强学习在推荐系统有什么最新进展? RL在推荐中的综述,用很短的篇幅把强化 ...

最新文章

  1. 没想到你是这样的直播研发骚年
  2. C++和Rust_后端程序员一定要看的语言大比拼:Java vs. Go vs. Rust
  3. Jenkins自动化CI CD流水线之8--流水线自动化发布Java项目
  4. 12v小型电机型号大全_鄂破碎机型号大全图,小型鄂破碎机价格
  5. 管理系统中计算机应用 重点章节,11年《管理系统中计算机应用》 第5章 重点要点.doc...
  6. multism中ui和uo应该怎么表示_Excel中VBA程序基本语法之强大的数组,了解数组的功能...
  7. python输入文本的缩写是什么_Python如何使用NLP从缩写文本中插入单词?
  8. [转载]jquery ajax/post/get 传参数给 mvc的action
  9. 中国ERP软件发展趋势
  10. C语言实例——荷兰国旗问题
  11. matlab绘图工具
  12. PDF 文字识别网站
  13. ddns动态域名注册
  14. 2048游戏(C语言)
  15. 演讲实录 :某大型股份制商业银行的容器化探索之路
  16. bundle install 出现 'gem install mysql2 -v '0.3.15' succeeds before bunding '
  17. 中国无糖牛奶巧克力行业市场供需与战略研究报告
  18. R语言使用epiDisplay包的statStack函数基于因子变量通过分层的方式查看连续变量的统计量(均值、中位数等)以及对应的假设检验、通过设置iqr参数强制函数执行参数检验
  19. 有无孔孟之道,太阳照常升起
  20. C语言实现飞翔的小鸟小游戏

热门文章

  1. python日本 老龄化分析_基于Python关于世界自杀率影响因素的分析以及机器学习预测...
  2. oracle 11g安装时,执行先决条件检查提示environment variable:“PATH“ 失败
  3. Python测试函数出现错误问题解决:AssertionError: None != ‘Janis Joplin‘
  4. 套料软件XSperNEST
  5. BBS社区运营,需要什么专业知识?
  6. MATLAB---csape斜率拟合
  7. 记录生活:绩优票之皖维高新
  8. STM32定时器+ADC制作简易示波器
  9. 华为云服务器打不开网站,云服务器打不开网页
  10. linux无法打开某些网页,Linux无法打开网页