一、强化学习简介
强化学习的过程可以理解为Agent与Environment的交互、学习、进步的过程,在井字棋中,可以简单的将其中的一方理解为Agent,另一方为Environment。交互的过程中主要有一下4个要素:

状态(state):指可能出现的情况或局面,在井字棋中指局面上的落子情况与先后手。
操作(action):指从一个状态(state)到另一个状态(state)的过程,在井字棋中指下一步的落子。
价值(value):衡量每一个状态(state)的好坏程度,在井字棋中指在当前局面下取胜的可能性。价值与状态的对应关系又称价值函数(value function)
增益(reward):指每一个操作所带来的价值(value)的变化,在井字棋中指这一步下的好不好。
强化学习实际是在模拟人类学习的过程。设想一个小孩在学习井字棋,一开始他完全不会下,只知道胜利局面的value是最高的,失败局面的value是最低的,只能随便下一个位置(explore),这样随便下几局之后,他会学习到,将要导致胜利的前几步的value是比较高的,将要导致失败的前几步的value是比较低的,这时通过explore使他对每个状态value的有了新的认识。当然他也会利用现有的知识(exploit),每当落子前,他会考虑落子后棋盘上可能的所有局面(state),然后选择一个对自己最有利的(max value)位置下棋。通过exploit,他可以利用他自己的知识使自己获胜的概率达到最大。

二、井字棋的算法
1、训练
强化学习井字棋训练的过程如下:

重复epochs次
    while true:
        if 分出胜负或平局
            返回结果,break
        随机的选择explore或exploit
        if 选择explore
            随机的选择落点下棋
        else 选择exploit
            从value_table中查找对应最大value状态的落点下棋
            根据新状态的value在value_table中更新原状态的value

其中“根据新状态的value在value_table中更新原状态的value” 是非常重要的一部分,决定了强化学习的学习方法,即能不能学到知识。由于井字棋状态逻辑非常简单,因此使用如下简单的表达式即可: 
V(S)=V(S)+α(V(S′)−V(S))
V(S)=V(S)+α(V(S′)−V(S))

其中VV表示value function,SS表示当前状态,S′S′表示新状态,V(S)V(S)表示S的value,αα 表示学习率,是可以调整的超参。
另外还需要控制的参数有训练次数epochs和选择explore的概率ϵϵ 。

# -*- coding: utf-8 -*-
"""
Created on Tue Jun 11 13:51:32 2019@author: judy.yuan
"""import numpy as np
import pickleBOARD_ROWS = 3
BOARD_COLS = 3
BOARD_SIZE = BOARD_ROWS * BOARD_COLSclass State:def __init__(self):# the board is represented by an n * n array,# 1 represents a chessman of the player who moves first,# -1 represents a chessman of another player# 0 represents an empty positionself.data = np.zeros((BOARD_ROWS, BOARD_COLS))self.winner = Noneself.hash_val = Noneself.end = None# compute the hash value for one state, it's uniquedef hash(self):if self.hash_val is None:self.hash_val = 0for i in np.nditer(self.data):self.hash_val = self.hash_val * 3 + i + 1return self.hash_val# check whether a player has won the game, or it's a tiedef is_end(self):if self.end is not None:return self.endresults = []# check rowfor i in range(BOARD_ROWS):results.append(np.sum(self.data[i, :]))# check columnsfor i in range(BOARD_COLS):results.append(np.sum(self.data[:, i]))# check diagonalstrace = 0reverse_trace = 0for i in range(BOARD_ROWS):trace += self.data[i, i]reverse_trace += self.data[i, BOARD_ROWS - 1 - i]results.append(trace)results.append(reverse_trace)for result in results:if result == 3:self.winner = 1self.end = Truereturn self.endif result == -3:self.winner = -1self.end = Truereturn self.end# whether it's a tiesum_values = np.sum(np.abs(self.data))if sum_values == BOARD_SIZE:self.winner = 0self.end = Truereturn self.end# game is still going onself.end = Falsereturn self.end# @symbol: 1 or -1# put chessman symbol in position (i, j)def next_state(self, i, j, symbol):new_state = State()new_state.data = np.copy(self.data)new_state.data[i, j] = symbolreturn new_state# print the boarddef print_state(self):for i in range(BOARD_ROWS):print('-------------')out = '| 'for j in range(BOARD_COLS):if self.data[i, j] == 1:token = '*'elif self.data[i, j] == -1:token = 'x'else:token = '0'out += token + ' | 'print(out)print('-------------')def get_all_states_impl(current_state, current_symbol, all_states):for i in range(BOARD_ROWS):for j in range(BOARD_COLS):if current_state.data[i][j] == 0:new_state = current_state.next_state(i, j, current_symbol)new_hash = new_state.hash()if new_hash not in all_states:is_end = new_state.is_end()all_states[new_hash] = (new_state, is_end)if not is_end:get_all_states_impl(new_state, -current_symbol, all_states)def get_all_states():current_symbol = 1current_state = State()all_states = dict()all_states[current_state.hash()] = (current_state, current_state.is_end())get_all_states_impl(current_state, current_symbol, all_states)return all_states# all possible board configurations
all_states = get_all_states()class Judger:# @player1: the player who will move first, its chessman will be 1# @player2: another player with a chessman -1def __init__(self, player1, player2):self.p1 = player1self.p2 = player2self.current_player = Noneself.p1_symbol = 1self.p2_symbol = -1self.p1.set_symbol(self.p1_symbol)self.p2.set_symbol(self.p2_symbol)self.current_state = State()def reset(self):self.p1.reset()self.p2.reset()def alternate(self):while True:yield self.p1yield self.p2# @print_state: if True, print each board during the gamedef play(self, print_state=False):alternator = self.alternate()self.reset()current_state = State()self.p1.set_state(current_state)self.p2.set_state(current_state)if print_state:current_state.print_state()while True:player = next(alternator)i, j, symbol = player.act()next_state_hash = current_state.next_state(i, j, symbol).hash()current_state, is_end = all_states[next_state_hash]self.p1.set_state(current_state)self.p2.set_state(current_state)if print_state:current_state.print_state()if is_end:return current_state.winner# AI player
class Player:# @step_size: the step size to update estimations# @epsilon: the probability to exploredef __init__(self, step_size=0.1, epsilon=0.1):self.estimations = dict()self.step_size = step_sizeself.epsilon = epsilonself.states = []self.greedy = []self.symbol = 0def reset(self):self.states = []self.greedy = []def set_state(self, state):self.states.append(state)self.greedy.append(True)def set_symbol(self, symbol):self.symbol = symbolfor hash_val in all_states:state, is_end = all_states[hash_val]if is_end:if state.winner == self.symbol:self.estimations[hash_val] = 1.0elif state.winner == 0:# we need to distinguish between a tie and a loseself.estimations[hash_val] = 0.5else:self.estimations[hash_val] = 0else:self.estimations[hash_val] = 0.5# update value estimationdef backup(self):states = [state.hash() for state in self.states]for i in reversed(range(len(states) - 1)):state = states[i]td_error = self.greedy[i] * (self.estimations[states[i + 1]] - self.estimations[state])self.estimations[state] += self.step_size * td_error# choose an action based on the statedef act(self):state = self.states[-1]next_states = []next_positions = []for i in range(BOARD_ROWS):for j in range(BOARD_COLS):if state.data[i, j] == 0:next_positions.append([i, j])next_states.append(state.next_state(i, j, self.symbol).hash())if np.random.rand() < self.epsilon:action = next_positions[np.random.randint(len(next_positions))]action.append(self.symbol)self.greedy[-1] = Falsereturn actionvalues = []for hash_val, pos in zip(next_states, next_positions):values.append((self.estimations[hash_val], pos))# to select one of the actions of equal value at random due to Python's sort is stablenp.random.shuffle(values)values.sort(key=lambda x: x[0], reverse=True)action = values[0][1]action.append(self.symbol)return actiondef save_policy(self):with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'wb') as f:pickle.dump(self.estimations, f)def load_policy(self):with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'rb') as f:self.estimations = pickle.load(f)# human interface
# input a number to put a chessman
# | q | w | e |
# | a | s | d |
# | z | x | c |
class HumanPlayer:def __init__(self, **kwargs):self.symbol = Noneself.keys = ['q', 'w', 'e', 'a', 's', 'd', 'z', 'x', 'c']self.state = Nonedef reset(self):passdef set_state(self, state):self.state = statedef set_symbol(self, symbol):self.symbol = symboldef act(self):self.state.print_state()key = input("Input your position:")data = self.keys.index(key)i = data // BOARD_COLSj = data % BOARD_COLSreturn i, j, self.symboldef train(epochs, print_every_n=500):player1 = Player(epsilon=0.01)player2 = Player(epsilon=0.01)judger = Judger(player1, player2)player1_win = 0.0player2_win = 0.0for i in range(1, epochs + 1):winner = judger.play(print_state=False)if winner == 1:player1_win += 1if winner == -1:player2_win += 1if i % print_every_n == 0:print('Epoch %d, player 1 winrate: %.02f, player 2 winrate: %.02f' % (i, player1_win / i, player2_win / i))player1.backup()player2.backup()judger.reset()player1.save_policy()player2.save_policy()def compete(turns):player1 = Player(epsilon=0)player2 = Player(epsilon=0)judger = Judger(player1, player2)player1.load_policy()player2.load_policy()player1_win = 0.0player2_win = 0.0for _ in range(turns):winner = judger.play()if winner == 1:player1_win += 1if winner == -1:player2_win += 1judger.reset()print('%d turns, player 1 win %.02f, player 2 win %.02f' % (turns, player1_win / turns, player2_win / turns))# The game is a zero sum game. If both players are playing with an optimal strategy, every game will end in a tie.
# So we test whether the AI can guarantee at least a tie if it goes second.
def play():while True:player1 = HumanPlayer()player2 = Player(epsilon=0)judger = Judger(player1, player2)player2.load_policy()winner = judger.play()if winner == player2.symbol:print("You lose!")elif winner == player1.symbol:print("You win!")else:print("It is a tie!")if __name__ == '__main__':train(int(1e5))compete(int(1e3))play()

通过简单的强化学习实现井字棋(Tic-Tac-Toe)相关推荐

  1. 井字棋小游戏c语言简单编码,C语言实现井字棋小游戏

    C语言实现简单的"井字棋游戏",供大家参考,具体内容如下 总体构造: 1.游戏菜单的逻辑实现 2.游戏本体的代码实现 part 1:游戏菜单的整体逻辑 ①简单的通过一个输入0和1的 ...

  2. LeetCode简单题之找出井字棋的获胜者

    题目 A 和 B 在一个 3 x 3 的网格上玩井字棋. 井字棋游戏的规则如下: 玩家轮流将棋子放在空方格 (" ") 上. 第一个玩家 A 总是用 "X" 作 ...

  3. python学习之井字棋游戏

    井字棋,英文名叫Tic-Tac-Toe,是一种在3*3格子上进行的连珠游戏,和五子棋类似,由于棋盘一般不画边框,格线排成井字故得名.游戏需要的工具仅为纸和笔,然后由分别代表O和X的两个游戏者轮流在格子 ...

  4. Principle of Computing (Python)学习笔记(7) DFS Search + Tic Tac Toe use MiniMax Stratedy

    1. Trees Tree is a recursive structure. 1.1 math nodes https://class.coursera.org/principlescomputin ...

  5. python井字棋ai_实现AI下井字棋的alpha-beta剪枝算法(python实现)

    代码参考自中国大学mooc上人工智能与信息社会陈斌老师的算法,我在原来的基础上增加了玩家输入的异常捕获 AlphaBeta剪枝算法是对Minimax方法的优化,能够极大提高搜索树的效率,如果对这个算法 ...

  6. python井字棋游戏代码_Python实现的井字棋(Tic Tac Toe)游戏示例

    Python实现的井字棋(Tic Tac Toe)游戏示例 来源:中文源码网    浏览: 次    日期:2018年9月2日 [下载文档:  Python实现的井字棋(Tic Tac Toe)游戏示 ...

  7. 用TensorFlow基于神经网络实现井字棋(含代码)

    为了展示如何应用神经网络算法模型,我们将使用神经网络来学习优化井字棋(Tic Tac Toe).明确井字棋是一种决策性游戏,并且走棋步骤优化是确定的. 开始 为了训练神经网络模型,我们有一系列优化的不 ...

  8. 深度强化学习入门:马尔可夫决策过程(井字棋案例理解)

    注:笔记 来自知乎文章 深度强化学习综述(上) Tips①:只是记录从这个文章学到的东西 Tips②:只摘选了文章中部分理论阅读整理 Tips③:重点是对文章中的一些公式进行理解,方便自己回顾 Tip ...

  9. 用数学方法计算井字棋合法局面数——波利亚定理的简单应用

    用数学方法计算井字棋合法局面数--波利亚定理的简单应用 0.前言 记不得那天在B站发现一个互动下井字棋智商普查(BV1JE411G71J),想当年被小学同学评价为无敌破战士(显然到现在都不知道这称号是 ...

最新文章

  1. python-selenum3 第五天定位——不常用定位与css定位详
  2. Ciruy英雄谭 Chapter 2 Web浏览器如何将数据委托出去(一)
  3. OSPF路由聚合实验(详细)
  4. Office EXCEL 如何为宏命令指定快捷键或者重新设置快捷键
  5. Get Started with Apex的playground练习
  6. 推荐专栏丨《DBA的奋斗路》
  7. Go的50坑:新Golang开发者要注意的陷阱、技巧和常见错误[1]
  8. duilib入门简明教程 -- 前言(1)
  9. 有趣的算法(八):3分钟看懂选择排序(C语言实现)
  10. [NISACTF 2022]上
  11. LaTeX中正负号写法
  12. 前端 《优化改良》 - VUE高效开发 - div - 自定义div的load事件 - 戴向天
  13. BAT文件批量创建目录或docx, xlsx, txt文件
  14. python-合并两个列表并去重
  15. 什么是DataOps?难道DataOps只是面向Data 的Ops吗?
  16. php实现bigpipe
  17. 用1元5角钱人名币兑换5分、2分和1分的硬币(每一种都要有)共一百枚,问共有几种兑换方案?并输出每种方案。
  18. 【GO+Iris】Iris框架初识
  19. php 代码 咖啡店,#5 php中的变量(二) | 祭夜の咖啡馆
  20. 金蝶软件快捷键小技巧

热门文章

  1. php页面会返回状态200,服务器404错误页面http状态返回值为200的原因解析
  2. 面试时,碰到职场“送命题”该怎么回答?送上这些有求生欲的答案~
  3. WEBBASIC Unit02 CSS 概述 、 CSS 语法 、 CSS 选择器 、 CSS声明
  4. C实现 费氏查找算法
  5. java判断200以内的素数_java判断101-200之间的素数并输出
  6. matlab 输出矩阵 逗号隔开,在MATLAB中自定义矩阵时,矩阵同行元素之间用逗号隔开,而每一行元素之间用分号隔开。...
  7. 微信小程序的websocket使用stomp协议--简单实用的npm包
  8. MySQL数据库课程设计_什么是数据库?如何学习数据库?
  9. qq登陆inc.php,JTBC(php) 版 QQ 一键登录实现过程
  10. 午芯高科“电容式”MEMS高性能数字气压传感器WXP380