通过简单的强化学习实现井字棋（Tic-Tac-Toe）

一、强化学习简介
强化学习的过程可以理解为Agent与Environment的交互、学习、进步的过程，在井字棋中，可以简单的将其中的一方理解为Agent，另一方为Environment。交互的过程中主要有一下4个要素：

状态（state）：指可能出现的情况或局面，在井字棋中指局面上的落子情况与先后手。
操作（action）：指从一个状态（state）到另一个状态（state）的过程，在井字棋中指下一步的落子。
价值（value）：衡量每一个状态（state）的好坏程度，在井字棋中指在当前局面下取胜的可能性。价值与状态的对应关系又称价值函数（value function）
增益（reward）：指每一个操作所带来的价值（value）的变化，在井字棋中指这一步下的好不好。
强化学习实际是在模拟人类学习的过程。设想一个小孩在学习井字棋，一开始他完全不会下，只知道胜利局面的value是最高的，失败局面的value是最低的，只能随便下一个位置（explore），这样随便下几局之后，他会学习到，将要导致胜利的前几步的value是比较高的，将要导致失败的前几步的value是比较低的，这时通过explore使他对每个状态value的有了新的认识。当然他也会利用现有的知识（exploit），每当落子前，他会考虑落子后棋盘上可能的所有局面（state），然后选择一个对自己最有利的（max value）位置下棋。通过exploit，他可以利用他自己的知识使自己获胜的概率达到最大。

二、井字棋的算法
1、训练
强化学习井字棋训练的过程如下：

重复epochs次
while true:
if 分出胜负或平局
返回结果，break
随机的选择explore或exploit
if 选择explore
随机的选择落点下棋
else 选择exploit
从value_table中查找对应最大value状态的落点下棋
根据新状态的value在value_table中更新原状态的value

其中“根据新状态的value在value_table中更新原状态的value” 是非常重要的一部分，决定了强化学习的学习方法，即能不能学到知识。由于井字棋状态逻辑非常简单，因此使用如下简单的表达式即可：
V(S)=V(S)+α(V(S′)−V(S))
V(S)=V(S)+α(V(S′)−V(S))

其中VV表示value function，SS表示当前状态，S′S′表示新状态，V(S)V(S)表示S的value，αα 表示学习率，是可以调整的超参。
另外还需要控制的参数有训练次数epochs和选择explore的概率ϵϵ 。

# -*- coding: utf-8 -*-
"""
Created on Tue Jun 11 13:51:32 2019@author: judy.yuan
"""import numpy as np
import pickleBOARD_ROWS = 3
BOARD_COLS = 3
BOARD_SIZE = BOARD_ROWS * BOARD_COLSclass State:def __init__(self):# the board is represented by an n * n array,# 1 represents a chessman of the player who moves first,# -1 represents a chessman of another player# 0 represents an empty positionself.data = np.zeros((BOARD_ROWS, BOARD_COLS))self.winner = Noneself.hash_val = Noneself.end = None# compute the hash value for one state, it's uniquedef hash(self):if self.hash_val is None:self.hash_val = 0for i in np.nditer(self.data):self.hash_val = self.hash_val * 3 + i + 1return self.hash_val# check whether a player has won the game, or it's a tiedef is_end(self):if self.end is not None:return self.endresults = []# check rowfor i in range(BOARD_ROWS):results.append(np.sum(self.data[i, :]))# check columnsfor i in range(BOARD_COLS):results.append(np.sum(self.data[:, i]))# check diagonalstrace = 0reverse_trace = 0for i in range(BOARD_ROWS):trace += self.data[i, i]reverse_trace += self.data[i, BOARD_ROWS - 1 - i]results.append(trace)results.append(reverse_trace)for result in results:if result == 3:self.winner = 1self.end = Truereturn self.endif result == -3:self.winner = -1self.end = Truereturn self.end# whether it's a tiesum_values = np.sum(np.abs(self.data))if sum_values == BOARD_SIZE:self.winner = 0self.end = Truereturn self.end# game is still going onself.end = Falsereturn self.end# @symbol: 1 or -1# put chessman symbol in position (i, j)def next_state(self, i, j, symbol):new_state = State()new_state.data = np.copy(self.data)new_state.data[i, j] = symbolreturn new_state# print the boarddef print_state(self):for i in range(BOARD_ROWS):print('-------------')out = '| 'for j in range(BOARD_COLS):if self.data[i, j] == 1:token = '*'elif self.data[i, j] == -1:token = 'x'else:token = '0'out += token + ' | 'print(out)print('-------------')def get_all_states_impl(current_state, current_symbol, all_states):for i in range(BOARD_ROWS):for j in range(BOARD_COLS):if current_state.data[i][j] == 0:new_state = current_state.next_state(i, j, current_symbol)new_hash = new_state.hash()if new_hash not in all_states:is_end = new_state.is_end()all_states[new_hash] = (new_state, is_end)if not is_end:get_all_states_impl(new_state, -current_symbol, all_states)def get_all_states():current_symbol = 1current_state = State()all_states = dict()all_states[current_state.hash()] = (current_state, current_state.is_end())get_all_states_impl(current_state, current_symbol, all_states)return all_states# all possible board configurations
all_states = get_all_states()class Judger:# @player1: the player who will move first, its chessman will be 1# @player2: another player with a chessman -1def __init__(self, player1, player2):self.p1 = player1self.p2 = player2self.current_player = Noneself.p1_symbol = 1self.p2_symbol = -1self.p1.set_symbol(self.p1_symbol)self.p2.set_symbol(self.p2_symbol)self.current_state = State()def reset(self):self.p1.reset()self.p2.reset()def alternate(self):while True:yield self.p1yield self.p2# @print_state: if True, print each board during the gamedef play(self, print_state=False):alternator = self.alternate()self.reset()current_state = State()self.p1.set_state(current_state)self.p2.set_state(current_state)if print_state:current_state.print_state()while True:player = next(alternator)i, j, symbol = player.act()next_state_hash = current_state.next_state(i, j, symbol).hash()current_state, is_end = all_states[next_state_hash]self.p1.set_state(current_state)self.p2.set_state(current_state)if print_state:current_state.print_state()if is_end:return current_state.winner# AI player
class Player:# @step_size: the step size to update estimations# @epsilon: the probability to exploredef __init__(self, step_size=0.1, epsilon=0.1):self.estimations = dict()self.step_size = step_sizeself.epsilon = epsilonself.states = []self.greedy = []self.symbol = 0def reset(self):self.states = []self.greedy = []def set_state(self, state):self.states.append(state)self.greedy.append(True)def set_symbol(self, symbol):self.symbol = symbolfor hash_val in all_states:state, is_end = all_states[hash_val]if is_end:if state.winner == self.symbol:self.estimations[hash_val] = 1.0elif state.winner == 0:# we need to distinguish between a tie and a loseself.estimations[hash_val] = 0.5else:self.estimations[hash_val] = 0else:self.estimations[hash_val] = 0.5# update value estimationdef backup(self):states = [state.hash() for state in self.states]for i in reversed(range(len(states) - 1)):state = states[i]td_error = self.greedy[i] * (self.estimations[states[i + 1]] - self.estimations[state])self.estimations[state] += self.step_size * td_error# choose an action based on the statedef act(self):state = self.states[-1]next_states = []next_positions = []for i in range(BOARD_ROWS):for j in range(BOARD_COLS):if state.data[i, j] == 0:next_positions.append([i, j])next_states.append(state.next_state(i, j, self.symbol).hash())if np.random.rand() < self.epsilon:action = next_positions[np.random.randint(len(next_positions))]action.append(self.symbol)self.greedy[-1] = Falsereturn actionvalues = []for hash_val, pos in zip(next_states, next_positions):values.append((self.estimations[hash_val], pos))# to select one of the actions of equal value at random due to Python's sort is stablenp.random.shuffle(values)values.sort(key=lambda x: x[0], reverse=True)action = values[0][1]action.append(self.symbol)return actiondef save_policy(self):with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'wb') as f:pickle.dump(self.estimations, f)def load_policy(self):with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'rb') as f:self.estimations = pickle.load(f)# human interface
# input a number to put a chessman
# | q | w | e |
# | a | s | d |
# | z | x | c |
class HumanPlayer:def __init__(self, **kwargs):self.symbol = Noneself.keys = ['q', 'w', 'e', 'a', 's', 'd', 'z', 'x', 'c']self.state = Nonedef reset(self):passdef set_state(self, state):self.state = statedef set_symbol(self, symbol):self.symbol = symboldef act(self):self.state.print_state()key = input("Input your position:")data = self.keys.index(key)i = data // BOARD_COLSj = data % BOARD_COLSreturn i, j, self.symboldef train(epochs, print_every_n=500):player1 = Player(epsilon=0.01)player2 = Player(epsilon=0.01)judger = Judger(player1, player2)player1_win = 0.0player2_win = 0.0for i in range(1, epochs + 1):winner = judger.play(print_state=False)if winner == 1:player1_win += 1if winner == -1:player2_win += 1if i % print_every_n == 0:print('Epoch %d, player 1 winrate: %.02f, player 2 winrate: %.02f' % (i, player1_win / i, player2_win / i))player1.backup()player2.backup()judger.reset()player1.save_policy()player2.save_policy()def compete(turns):player1 = Player(epsilon=0)player2 = Player(epsilon=0)judger = Judger(player1, player2)player1.load_policy()player2.load_policy()player1_win = 0.0player2_win = 0.0for _ in range(turns):winner = judger.play()if winner == 1:player1_win += 1if winner == -1:player2_win += 1judger.reset()print('%d turns, player 1 win %.02f, player 2 win %.02f' % (turns, player1_win / turns, player2_win / turns))# The game is a zero sum game. If both players are playing with an optimal strategy, every game will end in a tie.
# So we test whether the AI can guarantee at least a tie if it goes second.
def play():while True:player1 = HumanPlayer()player2 = Player(epsilon=0)judger = Judger(player1, player2)player2.load_policy()winner = judger.play()if winner == player2.symbol:print("You lose!")elif winner == player1.symbol:print("You win!")else:print("It is a tie!")if __name__ == '__main__':train(int(1e5))compete(int(1e3))play()

通过简单的强化学习实现井字棋（Tic-Tac-Toe）相关推荐

井字棋小游戏c语言简单编码,C语言实现井字棋小游戏
C语言实现简单的"井字棋游戏",供大家参考,具体内容如下总体构造: 1.游戏菜单的逻辑实现 2.游戏本体的代码实现 part 1:游戏菜单的整体逻辑 ①简单的通过一个输入0和1的 ...
LeetCode简单题之找出井字棋的获胜者
题目 A 和 B 在一个 3 x 3 的网格上玩井字棋. 井字棋游戏的规则如下: 玩家轮流将棋子放在空方格 (" ") 上. 第一个玩家 A 总是用 "X" 作 ...
python学习之井字棋游戏
井字棋,英文名叫Tic-Tac-Toe,是一种在3*3格子上进行的连珠游戏,和五子棋类似,由于棋盘一般不画边框,格线排成井字故得名.游戏需要的工具仅为纸和笔,然后由分别代表O和X的两个游戏者轮流在格子 ...
Principle of Computing (Python)学习笔记(7) DFS Search + Tic Tac Toe use MiniMax Stratedy
1. Trees Tree is a recursive structure. 1.1 math nodes https://class.coursera.org/principlescomputin ...
python井字棋ai_实现AI下井字棋的alpha-beta剪枝算法（python实现）
代码参考自中国大学mooc上人工智能与信息社会陈斌老师的算法,我在原来的基础上增加了玩家输入的异常捕获 AlphaBeta剪枝算法是对Minimax方法的优化,能够极大提高搜索树的效率,如果对这个算法 ...
python井字棋游戏代码_Python实现的井字棋（Tic Tac Toe）游戏示例
Python实现的井字棋(Tic Tac Toe)游戏示例来源:中文源码网浏览: 次日期:2018年9月2日 [下载文档: Python实现的井字棋(Tic Tac Toe)游戏示 ...
用TensorFlow基于神经网络实现井字棋（含代码）
为了展示如何应用神经网络算法模型,我们将使用神经网络来学习优化井字棋(Tic Tac Toe).明确井字棋是一种决策性游戏,并且走棋步骤优化是确定的. 开始为了训练神经网络模型,我们有一系列优化的不 ...
深度强化学习入门：马尔可夫决策过程（井字棋案例理解）
注:笔记来自知乎文章深度强化学习综述(上) Tips①:只是记录从这个文章学到的东西 Tips②:只摘选了文章中部分理论阅读整理 Tips③:重点是对文章中的一些公式进行理解,方便自己回顾 Tip ...
用数学方法计算井字棋合法局面数——波利亚定理的简单应用
用数学方法计算井字棋合法局面数--波利亚定理的简单应用 0.前言记不得那天在B站发现一个互动下井字棋智商普查(BV1JE411G71J),想当年被小学同学评价为无敌破战士(显然到现在都不知道这称号是 ...

通过简单的强化学习实现井字棋（Tic-Tac-Toe）

通过简单的强化学习实现井字棋（Tic-Tac-Toe）相关推荐

最新文章

热门文章