中英AlphaGo论文：精通围棋博弈的深层神经网络和树搜索算法(附PDF公号发“AlphaGo论文”下载论文双语对照版)

原创：秦陇纪数据简化DataSimp 今天

数据简化DataSimp导读：谷歌人工智能DeepMind围棋团队2016.1.28在《自然》杂志发表nature16961号论文《精通围棋博弈的深层神经网络和树搜索算法》英文讲述AlphaGo算法，是人工智能(AI机器学习深度学习)经典版本，值得收藏学习。本文翻译AlphaGo算法论文为汉译原文中英文对照版本，以后搭建平台实现并分析，关注本公号“数据简化DataSimp”后，在输入栏回复关键字“AlphaGo论文/谷歌围棋论文”获取本文PDF下载链接，欢迎赞赏支持。

数据DataSimp社区分享：信息与数据处理分析、数据科学研究前沿、数据资源现状和数据简化基础的科学知识、技术应用、产业活动、人物机构等信息。欢迎大家参与投稿，为数据科学技术做贡献，使国人尽快提高数据能力，提高社会信息能效。做事要平台，思路要跟进；止步吃住行，无力推文明。要推进人类文明，不可止步于敲门呐喊；设计空想太多，无法实现就虚度一生；工程能力至关重要，秦陇纪与君共勉之。祝大家学习愉快~~

中英AlphaGo论文：精通围棋博弈的深层神经网络和树搜索算法(51654字)

A 谷歌AlphaGo论文：精通围棋博弈的深层神经网络和树搜索算法(42766字)

1.导言Introduction

2.策略网络的监督学习SupervisedLearning of Policy Networks

3.策略网络的强化学习ReinforcementLearning of Policy Networks

4.估值网络的强化学习ReinforcementLearning of Value Networks

5.策略网络和估值网络搜索Searchingwith Policy and Value Networks

6.AlphaGo弈棋算力评估Evaluatingthe Playing Strength of AlphaGo

7.讨论Discussion

8.方法Methods

9.参考文献References

12.扩展数据Extendeddata

10.致谢Acknowledgements

11.作者信息AuthorContributions

13.补充信息Supplementaryinformation

14.评论Comments

B 深度神经网络经典算法AlphaGo论文指标和译后(8122字)

1.自然研究期刊文章指标ANature Research Journal Article metrics

2.深度神经网络经典算法AlphaGo论文译后花絮

参考文献(3287字)Appx(1236字).数据简化DataSimp社区简介

A谷歌***AlphaGo*论文：精通围棋博弈的深层神经网络和树搜索算法(42766**字)

精通围棋博弈的深层神经网络和树搜索算法

文|谷歌DeepMind组，译|秦陇纪等，数据简化DataSimp20160316Wed-1105Mon

Mastering the game of Go withdeep neural networks and tree search

名称：精通围棋博弈的深层神经网络和树搜索算法

Author: David Silver1*, Aja Huang1*, Chris J. Maddison1, ArthurGuez1, Laurent Sifre1, George van den Driessche1, Julian Schrittwieser1,Ioannis Antonoglou1, Veda Panneershelvam1, Marc Lanctot1, Sander Dieleman1,Dominik Grewe1, John Nham2, Nal Kalchbrenner1, Ilya Sutskever2, TimothyLillicrap1, Madeleine Leach1, Koray Kavukcuoglu1, Thore Graepel1 & DemisHassabis1

作者：①戴维·斯尔弗1*,②黄士杰1*,③克里斯·J.·麦迪逊1,④亚瑟·格斯1,⑤劳伦特·西弗瑞1,⑥乔治·范登·德里施1,⑦朱利安·施立特威泽1,⑧扬尼斯·安东诺娄1,⑨吠陀·潘聂施尔万1,⑩马克·兰多特1,⑪伞德·迪勒曼1,⑫多米尼克·格鲁1,⑬约翰·纳姆2,⑭纳尔卡尔克布伦纳1,⑮伊利亚·萨茨基弗2,⑯蒂莫西·李烈克莱普1,⑰马德琳·里奇1,⑱科瑞·卡瓦口格鲁1,⑲托雷·格雷佩尔1，和⑳戴密斯·哈萨比斯1

单位：1.谷歌DeepMind，英国伦敦EC4A3TW，新街广场5号。2.谷歌，美国加利福尼亚州94043，景山，剧场路1600号。*这些作者对这项工作作出了同等贡献。

摘要：由于海量搜索空间、评估棋局和落子行为的难度，围棋长期以来被视为人工智能领域最具挑战的经典游戏。这里，我们介绍一种新的电脑围棋算法：使用“估值网络”评估棋局、“策略网络”选择落子。这些深层神经网络，是由人类专家博弈训练的监督学习和电脑自我博弈训练的强化学习，共同构成的一种新型组合。没有任何预先搜索的情境下，这些神经网络能与顶尖水平的、模拟了千万次随机自我博弈的蒙特卡洛树搜索程序下围棋。我们还介绍一种新的搜索算法：结合了估值和策略网络的蒙特卡洛模拟算法。用这种搜索算法，我们的程序AlphaGo与其它围棋程序对弈达到99.8%的胜率，并以5比0击败了人类的欧洲围棋冠军。这是计算机程序第一次在标准围棋比赛中击败一个人类职业棋手——以前这被认为是需要至少十年以上才能实现的伟业。

Abstract：The game of Go has long beenviewed as the most challenging of classic games for artificial intelligenceowing to its enormous search space and the difficulty of evaluating boardpositions and moves. Here we introduce a new approach to computer Go that uses‘value networks’ to evaluate board positions and ‘policy networks’ to selectmoves. These deep neural networks are trained by a novel combination ofsupervised learning from human expert games, and reinforcement learning fromgames of self-play. Without any lookahead search, the neural networks play Goat the level of stateof-the-art Monte Carlo tree search programs that simulatethousands of random games of self-play. We also introduce a new searchalgorithm that combines Monte Carlo simulation with value and policy networks.Using this search algorithm, our program AlphaGoachieved a 99.8% winning rate against other Go programs, and defeated the humanEuropean Go champion by 5 games to 0. This is the first time that a computerprogram has defeated a human professional player in the full-sized game of Go,a feat previously thought to be at least a decade away.

(译注1：论文有15部分：0摘要Abstact、1导言Introduction、2策略网络的监督学习Supervised learningof policy networks、3策略网络的强化学习ReinforcementLearning of Policy Networks、4估值网络的强化学习ReinforcementLearning of Value Networks、5基于策略网络和估值网络的搜索算法Searching with Policyand Value Networks、6AlphaGo博弈算力评估Evaluating theplaying strength of AlphaGo、7讨论Discussion(参考文献References1-38)、8方法METHODS(9参考文献References39-62)、10致谢Acknowledgements、11作者信息Author Information(作者贡献Author Contributions)、12扩展数据Extended data(扩展数据图像和表格Extended Data Table)、13补充信息Supplementaryinformation(权力和许可Rights andpermissions、文章相关About this article、延伸阅读Further reading)和14评论Comments。其中，9参考资料References正文和讨论部分38篇，方法部分24篇，合计有62篇。另外，自然期刊在线资料还包括论文PDF、6张大图、6个对弈PPT和1个补充信息压缩包，论文网址http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html。新媒体文章省略第8部分等，全部资料ZIP压缩包，在本社区可下载。)

1.导言Introduction

完美信息类游戏都有一种最优值函数v*(s)，从所有游戏者完美对弈时每一棋盘局面或状态s，判断出游戏结果。这类游戏可以通过递归计算一个约含bd种可能落子情况序列的搜索树，求得上述最优值函数来解决。这里，b是游戏广度(每个局面可合法落子的数量)，d是游戏深度(对弈步数)。在国际象棋(b≈35，d≈80)1，特别是围棋(b≈250，d≈150)1等大型游戏中，虽然穷举搜索并不可取2,3，但有两种常规方法可以减少其有效搜索空间。第一种方法，搜索深度可以通过局面评估来降低：用状态s截断搜索树，将s的下级子树用预测状态s结果的近似值函数v(s)≈v*(s)代替。这种做法在国际象棋4，跳棋5和奥赛罗6中取得了超过人类的性能；但由于围棋7的复杂性，这种做法据信在围棋中变得棘手。第二种方法，搜索广度可以用局面s中表示可能落子a的策略函数p(a|s)产生的概率分布的弈法抽样来降低。例如，蒙特卡洛走子算法8搜索到最大深度时无任何分支，而是用策略变量p为对弈双方的长弈法序列采样。大体上，这些走子行为提供了一种有效的局面评估，在五子棋8、拼字游戏9和低水平业余围棋比赛10中均实现了超越人类水平的性能。

蒙特卡洛树搜索(MCTS)11,12用蒙特卡洛走子来估算一个搜索树中每个状态的值。随着更多模拟情况的执行，该搜索树生长变大、相关值变得更加准确。随着时间的推移，通过选择那些较高估值的子树，搜索过程中选择弈法的策略也得到了提高。该策略渐进收敛于最优弈法，对应的估值结果收敛于该最优值函数12。当下最强的围棋程序都基于MCTS，通过预测人类高手落子情况而训练的一些策略，来增强性能13。这些策略大都把此搜索过程限制在高概率弈法，以及走子时的弈法采样。该方法已经在很强的业余博弈中取得了成功13–15。然而，以前的做法仅限于浅层策略13–15，或某些基于一种带输入型特征值的线性函数组合的估值函数。

近来，深度卷积神经网络在视觉领域达到前所未有的高性能：如图像分类17、人脸识别18、雅达利游戏19。他们用重叠排列的多层神经元，逐步构建图像的局部抽象表征20。我们在围棋中采用类似架构：通过把棋局看做为一个19×19的图像，使用若干卷积层构造该局面的表征值。用这些神经网络，我们来减少有效深度及搜索树广度：用一个估值网络评估棋局，用一个策略网络做弈法取样。

我们用一种由机器学习若干阶段组成的管道来训练这些神经网络(图1)。开始阶段，我们直接用人类高手走子弈法训练一种有监督学习(SL)策略网络pσ。此阶段提供快速、高效的带有即时反馈和高品质梯度的机器学习更新数据。类似以前的做法13,15，我们也训练了一个快速走子策略pπ，能对走子弈法快速采样。接下来的阶段，我们训练一种强化学习(RL)策略网络pρ，通过优化那些自我博弈的最终结果，来提高前面的SL策略网络。此阶段是将该策略调校到赢取比赛的正确目标上，而非最大程度的预测准确性。最后阶段，我们训练一种估值网络Vθ，来预测那些采用RL走棋策略网络自我博弈的赢家。我们的程序AlphaGo，用MCTS有效结合了策略和估值网络。

图1：神经网络训练管道和架构。左图1a，一种快速走子策略pπ和监督学习(SL)策略网络pσ被训练，用来预测一个局面数据集中人类高手的落子情况。一种强化学习(RL)策略网络pρ按该SL策略网络进行初始化，然后对前一版策略网络用策略梯度学习来最大化该结果(即赢得更多比赛)。通过和这个RL策略网络自我博弈，产生一个新数据集。最后,一种估值网络vθ由回归训练的，用来预测此自我博弈数据集里面局面的预期结果(即是否当前棋手获胜)。右图1b，AlphaGo神经网络架构的示意图。图中的策略网络表示：作为输入变量的棋局s，通过带参数σ(SL策略网络)或ρ(RL策略网络)的许多卷积层，输出合法落子情况a的概率分布pσ(a|s)或pρ(a|s)，由此局面概率图来呈现。此估值网络同样使用许多带参数θ的卷积层，但输出一个用来预测局面sʹ预期结果的标量值vθ(sʹ)。

Figure 1: Neural network trainingpipeline and architecture. a A fastrollout policy p_ and supervised learning (SL) policy network pπ are trained to predict human expert moves in a data-setof positions. A reinforcement learning (RL) policy network pσ is initialised to the SL policy network, and is thenimproved by policy gradient learning to maximize the outcome (i.e. winning moregames) against previous versions of the policy network. A new data-set isgenerated by playing games of self-play with the RL policy network. Finally, avalue network vθ is trained by regression topredict the expected outcome (i.e. whether the current player wins) inpositions from the selfplay data-set. bSchematic representation of the neural network architecture used in AlphaGo. The policy network takes arepresentation of the board position s as its input, passes it through manyconvolutional layers with parameters σ(SL policy network) or ρ(RL policy network), and outputs a probabilitydistribution pσ(a|s) or pρ(a|s) over legal moves a,represented by a probability map over the board. The value network similarlyuses many convolutional layers with parameters θ, but outputs a scalar value vθ(sʹ) that predicts the expected outcome in position sʹ.

All games of perfect informationhave an optimal value function, v*(s), which determines the outcome of thegame, from every board position or state s, under perfect play by all players.These games may be solved by recursively computing the optimal value functionin a search tree containing approximately bd possible sequences ofmoves, where b is the game’s breadth (number of legal moves per position) and d is its depth (game length).In large games, such as chess (b≈35; d≈80)1and especially Go (b≈250; d≈150)1, exhaustive search isinfeasible2, 3, but the effective search space can be reduced by twogeneral principles. First, the depth of the search may be reduced by position evaluation:truncating the search tree at state s and replacing the subtree below s by anapproximate value function v(s)≈v*(s) that predicts the outcome from state s.This approach has led to super-human performance in chess4, checkers5and othello6, but it was believed to be intractable in Go due to thecomplexity of the game7. Second, the breadth of the search may be reduced bysampling actions from a policy p(a|s) that is a probability distribution overpossible moves a in position s. For example, Monte-Carlo rollouts8search to maximum depthwithout branching at all, by sampling long sequences of actions for bothplayers from a policy p. Averaging over such rollouts can provide an effectiveposition evaluation, achieving super-human performance in backgammon8and Scrabble9, and weak amateur level play in Go10.

Monte-Carlo tree search (MCTS)11, 12uses Monte-Carlo rollouts to estimate the value of eachstate in a search tree. As more simulations are executed, the search tree growslarger and the relevant values become more accurate. The policy used to selectactions during search is also improved over time, by selecting children withhigher values. Asymptotically, this policy converges to optimal play, and theevaluations converge to the optimal value function12. The strongest current Goprograms are based on MCTS, enhanced by policies that are trained to predicthuman expert moves13. These policies are used to narrow the search to a beamof high probability actions, and to sample actions during rollouts. Thisapproach has achieved strong amateur play13–15. However, prior work has been limited to shallowpolicies13–15or value functions16based on a linear combination of input features.

Recently, deep convolutional neural networks haveachieved unprecedented performance in visual domains: for example imageclassification17, face recognition18, and playing Atari games19. They use many layers ofneurons, each arranged in overlapping tiles, to construct increasinglyabstract, localised representations of an image20. We employ a similararchitecture for the game of Go. We pass in the board position as a 19×19 image and use convolutionallayers to construct a representation of the position. We use these neuralnetworks to reduce the effective depth and breadth of the search tree:evaluating positions using a value network, and sampling actions using a policynetwork.

We train the neural networks using a pipeline consistingof several stages of machine learning (Figure 1). We begin by training asupervised learning (SL) policy network, pσ, directly from experthuman moves. This provides fast, efficient learning updates with immediatefeedback and high quality gradients. Similar to prior work13, 15, we also train a fast policypπthat can rapidly sample actions during rollouts. Next, wetrain a reinforcement learning (RL) policy network, pρ, thatimproves the SL policy network by optimising the final outcome of games ofself-play. This adjusts the policy towards the correct goal of winning games,rather than maximizing predictive accuracy. Finally, we train a value network vθthat predicts the winner ofgames played by the RL policy network against itself. Our program AlphaGo efficiently combines the policyand value networks with MCTS.

2.策略网络的监督学习Supervised Learning ofPolicy Networks

训练管道第一阶段，我们按以前的做法用监督学习预测围棋中高手的落子情况13, 21–24。此SL策略网络pσ(a|s)在带有权重数组变量σ和整流器非线性特征值数组的卷积层间交替使用。最终的softmax层输出一个所有合法落子情况的概率分布a。此策略网络的输入变量s是一个棋局状态的简单标识变量(见扩展数据表2)。策略网络基于随机采样的棋盘情形-操作对(s,a)做训练：采用随机梯度升序法，在选定状态s时，取人类落子a的最大相似度，

我们用KGS围棋服务器上的3000万种棋局，训练了一个13层策略网络，称之为SL策略网络。对比其他研究团体提交的44.4%顶尖水准，该网络在一个公开测试数据集上预测高手落子情况：采用全部输入型特征值可达57.0%精度，只采用原始棋局和落子历史数据做为输入可达55.7%(全部结果在扩展数据表3)24。准确性上小的改进，可导致算力大幅提高(图2a)；较大网络亦可实现更好的精度，但在搜索过程中的评价会变慢。我们也训练了一个快速、但低准确度的走子策略pπ(a|s)，采用一种带权重π的小图式特征量的线性softmax层(参见扩展数据表4)，这样，仅用2微秒选择一种弈法可以达到24.2%的精确度，而不是此策略网络的3毫秒。

图2：策略和估值网络的算力和准确性。图2a，标尺图展示作为一个他们训练精确性函数的策略网络博弈算力。每个卷积层分别有128，192，256和384个过滤器的策略网络在训练期间被定期评估；此图显示AlphaGo运用那种策略网络与比赛版AlphaGo对战的胜率。图2b，该估值网络和不同策略走子弈法之间的估值精度比较。从人类专家博弈中做局面和结果采样。每局都由一个单一向前传递的估值网络vθ，或100步走子情况的平均结果做评估，用均匀随机走子，快速走子策略pπ，SL策略网络pσ或RL策略网络pρ等使局面充分被评估。此预测值和实际博弈间的均方差，绘制在博弈阶段(多少落子已经在给定局面)。

Figure 2: Strength and accuracy ofpolicy and value networks. a Plotshowing the playing strength of policy networks as a function of their trainingaccuracy. Policy networks with 128, 192, 256 and 384 convolutional filters perlayer were evaluated periodically during training; the plot shows the winningrate of AlphaGo using that policynetwork against the match version of AlphaGo.b Comparison of evaluation accuracybetween the value network and rollouts with different policies. Positions andoutcomes were sampled from human expert games. Each position was evaluated by asingle forward pass of the value network vθ, or by the mean outcome of 100 rollouts, played outusing either uniform random rollouts, the fast rollout policy pπ, the SL policy network pσor the RL policy network pρ. The mean squared errorbetween the predicted value and the actual game outcome is plotted against thestage of the game (how many moves had been played in the given position).

For the first stage of the training pipeline, we build onprior work on predicting expert moves in the game of Go using supervisedlearning13, 21–24. The SL policy network pσ(a|s) alternatesbetween convolutional layers with weights σ, and rectifier non-linearities. Afinal softmax layer outputs a probability distribution over all legal moves a.The input s to the policy network is a simple representation of the board state(see Extended Data Table 2). The policy network is trained on randomly sampledstate-action pairs (s, a), using stochastic gradient ascent to maximize thelikelihood of the human move a selected in state s,

Δσ ∝ ( ∂log pσ(a|s) /∂σ ) (1)

We trained a 13 layer policy network, which we call theSL policy network, from 30 million positions from the KGS Go Server. Thenetwork predicted expert moves with an accuracy of 57.0% on a held out testset, using all input features, and 55.7% using only raw board position and movehistory as inputs, compared to the state-of-the-art from other research groupsof 44.4% at date of submission24(full results in Extended DataTable 3). Small improvements in accuracy led to large improvements in playingstrength (Figure 2,a); larger networks achieve better accuracy but are slowerto evaluate during search. We also trained a faster but less accurate rolloutpolicy pπ(a|s), using a linear softmax of small pattern features(see Extended Data Table 4) with weights π; this achieved an accuracy of 24.2%,using just 2 μs to select an action, rather than 3 ms for the policy network.

3.策略网络的强化学习Reinforcement Learningof Policy Networks

训练管道第二阶段，旨在用策略梯度型强化学习(RL)来提高之前的策略网络25, 26。这种RL策略网络pρ在结构上与SL策略网络相同，其权重ρ被初始化为相同值：ρ=σ。我们使其在当前策略网络pρ和某个随机选择的上次迭代产生的策略网络之间进行对弈。这种方法的训练，要用随机化的存有对手稳定态的数据池，来防止对当前策略的过度拟合。我们使用报酬函数r(s)，对所有非终端时间步长t<T时，赋值为0。其结果值zt = ± r(sT)是博弈结束时的终端奖励：按照当前博弈者在时间步长t时的预期，给胜方+1、败方−1。权重在每一次步长变量t时，按照预期结果最大值的方向，进行随机梯度升序更新25。

博弈中我们评估该RL策略网络的性能，从弈法输出概率分布对每一次落子采样为at ~pρ(·|st)。与SL策略网络正面博弈时，RL策略网络赢得了80%以上。我们还用最厉害的开源围棋程序Pachi14来测试。那是一种复杂的蒙特卡洛搜索程序——KGS服务器上排名第二的业余选手dan，每个落子要执行10万次模拟。不用任何搜索，RL策略网络赢得了85%与Pachi的对弈。对照以前的顶尖水平，仅基于卷积网络的监督学习与Pachi23对弈只能赢得11%、与较弱程序Fuego24对弈为12%。

The second stage of the training pipeline aims at improvingthe policy network by policy gradient reinforcement learning (RL)25, 26. The RL policy network pρis identical in structure tothe SL policy network, and its weights ρ are initialised to the same values, ρ=σ.We play games between the current policy network pρand a randomly selectedprevious iteration of the policy network. Randomising from a pool of opponentsstabilises training by preventing overfitting to the current policy. We use areward function r(s) that is zero for all non-terminal time-steps t<T. Theoutcome zt=±r(sT) is the terminal reward at the end of the game from theperspective of the current player at time-step t: +1 for winning and -1 for losing. Weights are thenupdated at each time-step t by stochastic gradient ascent in the direction thatmaximizes expected outcome25,

Δρ ∝ ( ∂log pρ(at|st) / ∂ρ)zt (2)

We evaluated the performance of the RL policy network ingame play, sampling each move at ~pρ(·|st) from its output probability distribution over actions.When played head-to-head, the RL policy network won more than 80% of gamesagainst the SL policy network. We also tested against the strongest open-sourceGo program, Pachi14, a sophisticated Monte-Carlosearch program, ranked at 2 amateur danon KGS, that executes 100,000 simulations per move. Using no search at all, theRL policy network won 85% of games against Pachi.In comparison, the previous state-of-the-art, based only on supervised learningof convolutional networks, won 11% of games against Pachi23and 12% against a slightlyweaker program Fuego24.

4.估值网络的强化学习Reinforcement Learningof Value Networks

最后阶段的训练管道聚焦在对棋局的评估，用一个估值函数vp(s)做估计，给棋局s中两个使用策略p的博弈者预测结果28, 29, 30。

vp(s) = Ε[zt|st = s, at...T~p]. (3)

理想情况下，我们想知道完美博弈v*(s)中的该最优值函数；实践中，我们用值函数vpρ代替做估算，作为最强策略用在RL策略网络pρ。我们用带权重数组θ的估值网络vθ(s)对此估值函数做近似，vθ(s)≈vpρ(s)≈v*(s)。该神经网络具有一种与此估值函数相似的结构，但输出一个单一预测，而不是一个概率分布。我们用状态-结果对(s,z)回归，训练该估值网络权重，使用随机梯度降序来最小化该预测值vθ(s)和相应结果z间的均方差(MSE)，

用包含全部博弈的数据集，来预测对弈结果的幼稚做法，会导致过度拟合。其错误在于：连续棋局是紧密相关的，不同处只有一枚棋子，但其回归目标被该完整对弈所共用。我们用这种方法在KGS数据集做过训练，该估值网络记住了那些博弈结果，并没有推广到新棋局，相比此训练集上的0.19，此测试集上达到了0.37的最小均方差(MSE)。为了缓解这个问题，我们生成了一个新的含有3000万明显不同棋局的自我博弈数据集，其每个采样都来自于某一单独对弈。每一场对弈都是在上述RL策略网络与自身之间进行，直到博弈结束。在该数据集上的训练，采用训练和测试数据集分别可达到0.226和0.234的均方差，这表明最小的过拟合。图2b显示了上述估值网络的棋局评估精度，相比使用快速走子策略pπ的蒙特卡洛走子程序，此估值函数一贯都是更加准确。一种vθ(s)单一评价函数也接近使用RL策略网络Pρ的蒙特卡洛程序的精度，且使用少于15000次的计算量。

The final stage of the training pipeline focuses on positionevaluation, estimating a value function vp(s) that predicts the outcome from position s of gamesplayed by using policy p for both players27–29,

vp(s) = Ε[zt|st = s, at...T~p]. (3)

Ideally, we would like to know the optimal value functionunder perfect play v*(s); in practice, we instead estimate the value function vpρ_ for our strongest policy, using the RL policy network pρ.We approximate the value function using a value network vθ(s) withweights θ, vθ(s)≈vpρ(s)≈v*(s). This neural network has a similar architectureto the policy network, but outputs a single prediction instead of a probabilitydistribution. We train the weights of the value network by regression on state-outcomepairs (s, z), using stochastic gradient descent to minimize the mean squarederror (MSE) between the predicted value vθ(s), and the correspondingoutcome z,

Δθ ∝ ( ∂vθ(s) / ∂θ ) (z-vθ(s)). (4)

The naive approach of predicting game outcomes from dataconsisting of complete games leads to overfitting. The problem is thatsuccessive positions are strongly correlated, differing by just one stone, butthe regression target is shared for the entire game. When trained on the KGSdataset in this way, the value network memorised the game outcomes rather thangeneralising to new positions, achieving a minimum MSE of 0.37 on the test set,compared to 0.19 on the training set. To mitigate this problem, we generated anew self-play data-set consisting of 30 million distinct positions, eachsampled from a separate game. Each game was played between the RL policynetwork and itself until the game terminated. Training on this data-set led toMSEs of 0.226 and 0.234 on the training and test set, indicating minimal overfitting.Figure 2,b shows the position evaluation accuracy of the value network,compared to Monte-Carlo rollouts using the fast rollout policy pπ;the value function was consistently more accurate. A single evaluation of vθ(s)also approached the accuracy of Monte-Carlo rollouts using the RL policynetwork pρ, but using 15,000 times less computation.

5.策略网络和估值网络搜索Searching with Policyand Value Networks

AlphaGo在一种采用前向搜索选择弈法的MCTS算法里，结合使用策略和估值网络(图3)。每个搜索树边界(s,a)存储：弈法值Q(s,a)，访问计数N(s,a)，和前驱概率P(s,a)。从当前根状态出发，该搜索树用模拟(指已完成的博弈中做无备份降序)做遍历。在每次模拟的每个时间步长t，从状态st内选出一个弈法at，

当满足，最大弈法值加上与前驱概率成正比、但与访问计数成反比的奖励值：，能有效促进对搜索空间的探索。当这个遍历在步骤L，搜索一个叶节点sL时，该叶节点可能被展开。该叶节点的局面sL仅通过SL型策略网络pσ处理一次。该输出概率被存储为每次合法弈法a的前驱概率P(s,a) = pσ(a|s)。这个叶节点通过两种不同方式被评估：一种是通过估值网络vθ(sL)；第二种是，通过一种随机落子的结果值zL，直到使用快速走子策略pπ在步长T时结束博弈。这些评价被组合起来，用一种混合参数λ，进入一个叶节点估值V(sL)：

模拟结束时，遍历过的所有边界其弈法值和和访问计数就会被更新。每个边界累加其访问计数值，和所有经过该边界做的模拟的平均估值：

式中是其第i次模拟的叶节点，1(s,a,i)代表第i次模拟中一个边界(s,a)是否被访问。当该搜索结束时，本算法选择这次初始局面模拟的访问计数最多的弈法来落子。

图3：AlphaGo的蒙特卡洛树搜索。图3a，每次模拟都遍历带最大弈法值Q的那个边界节点，与一个由那个边界节点存储的前驱概率P产生的奖励值u(P)相加。图3b，此叶节点可能被展开；新节点采用策略网络pσ，其输出概率值被存储在每个弈法的前驱概率P中。图3c，模拟结束后，此叶节点被两种方法评估：采用估值网络vθ；和博弈最后用快速落子策略pπ进运行一次走子，然后用函数r计算此赢家的估值。图3d，弈法值Q被更新，用来追踪所有估值r(·)的中间值和那个弈法下面的子树vθ(·)。

Figure 3: Monte-Carlo tree search in AlphaGo.a Each simulation traverses the treeby selecting the edge with maximum action-value Q, plus a bonus u(P) thatdepends on a stored prior probability P for that edge. b The leaf node may be expanded; the new node is processed once bythe policy network pσand the output probabilities are stored as priorprobabilities P for each action. cAt the end of a simulation, the leaf node is evaluated in two ways: using thevalue network vθ; and by running a rollout to the end of the gamewith the fast rollout policy pπ, then computing the winner withfunction r. d Action-values Q areupdated to track the mean value of all evaluations r(·) and vθ(·) inthe subtree below that action.

值得注意的是，此AlphaGo的SL策略网络pσ比那个加强型RL策略网络pρ表现地更好，主要原因在于人类选择最有前景落子中一种可变化的弈法，而RL仅对该单次落子做最优化。然而，从强化后的RL策略网络中推导的估值函数，在AlphaGo的性能要优于SL策略网络推导出的估值函数。

跟传统启发式搜索相比，策略和估值网络需要高出几个数量级的计算量。为了有效结合MCTS和深度神经网络，AlphaGo采用异步多线程搜索，在多CPU上执行模拟、多GPU并行计算策略和估值网络。本最终版AlphaGo使用了40个搜索线程、48个CPU和8个GPU。我们也应用了一种分布式AlphaGo版本，部署在多台机器上、40个搜索线程、1202个CPU和176个GPU。方法章节提供异步和分布式MCTS全部细节。

AlphaGocombines the policy and value networks in an MCTS algorithm(Figure 3) that selects actions by lookahead search. Each edge (s,a) of thesearch tree stores an action value Q(s,a), visit count N(s,a), and prior probability P(s,a). The tree is traversedby simulation (i.e. descending the tree in complete games without backup),starting from the root state. At each time-step t of each simulation, an actionat is selected from state st,

at = argmax a ( Q(st,a) + u(st,a) ) (5)

so as to maximize action value plus a bonus

u(s,a) ∝ P(s,a) / (1+N(s,a) )

that is proportional to theprior probability but decays with repeated visits to encourage exploration.When the traversal reaches a leaf node sLat step L, the leaf node maybe expanded. The leaf position sLis processed just once by theSL policy network pρ. The output probabilities are storedas prior probabilities P for each legal action a, P(s,a) = pσ(a|s).The leaf node is evaluated in two very different ways: first, by the valuenetwork vθ(sL); and second, by the outcome zLof a random rollout played out until terminal step Tusing the fast rollout policy pπ; these evaluations are combined,using a mixing parameter λ, into a leaf evaluation V (sL),

V (sL) = (1-λ)vθ(sL) + λzL (6)

At the end of simulation n,the action values and visit counts of all traversed edges are updated. Eachedge accumulates the visit count and mean evaluation of all simulations passingthrough that edge,

N(s,a) = ∑n i=11(s,a,i) (7)

Q(s,a) = 1/N(s,a) ∑ni=1 1(s,a,i)V(si L), (8)

where si L is the leaf node from the ith simulation, and 1(s,a,i)indicates whether an edge (s,a) was traversed during the ith simulation. Oncethe search is complete, the algorithm chooses the most visited move from theroot position.

The SL policy network pσperformed better in AlphaGo than the stronger RL policynetwork pρ, presumably because humans select a diverse beam ofpromising moves, whereas RL optimizes for the single best move. However, thevalue function vθ(s)≈vpρ(s) derived from the stronger RL policy network performedbetter in AlphaGo than a valuefunction vθ(s)≈vpσ (s) derived from the SL policynetwork.

Evaluating policy and valuenetworks requires several orders of magnitude more computation than traditionalsearch heuristics. To efficiently combine MCTS with deep neural networks, AlphaGo uses an asynchronousmulti-threaded search that executes simulations on CPUs, and computes policyand value networks in parallel on GPUs. The final version of AlphaGo used 40 search threads, 48 CPUs,and 8 GPUs. We also implemented a distributed version of AlphaGo that exploited multiple machines, 40 search threads, 1202CPUs and 176 GPUs. The Methods section provides full details of asynchronousand distributed MCTS.

6.AlphaGo弈棋算力评估Evaluating the PlayingStrength of AlphaGo

为了评估AlphaGo，我们在几个版本的AlphaGo和其它几种围棋程序之间运行了一场内部竞赛，包括最强商业软件Crazy Stone13，和Zen，和最强开源程序Pachi14和Fuego15。所有这些程序基于高性能MCTS算法。此外，我们纳入了开源程序GnuGo，一种使用优于MCTS的顶级水平搜索算法的围棋程序。在比赛中，所有软件每一步都只有5s中的计算时间。

比赛结果如图4所示，单机版的AlphaGo比其它的软件更优秀，赢得了495场比赛中的494场的胜利。为了提高难度，我们让AlphaGo在对阵过程中让子，结果显示AlphaGo对CrazyStone，Zen和Pachi的胜率依然在77%，86%和99%。分布式的AlphaGo的性能更强，对阵单机版AlphaGo的胜率是77%，对阵其他程序则是完胜。

图4：AlphaGo比赛评估。图4a，Elo分数——不同围棋程序间的锦标赛结果(参见扩展数据表6至11)。每个程序每次落子使用大约5秒的计算时间。为了给AlphaGo提供更大的挑战，一些程序(苍白的上方栏杆)被给予4个障碍石(即每场比赛开始时的气眼落子)对抗所有对手。计划在Elo评分30上进行评估：230分差距相当于79％的获胜概率，大致相当于KGS31的一个业余丹等级优势;还显示了与人类等级的近似对应关系，水平线显示该程序在线实现的KGS等级。还包括对抗欧洲人类冠军樊麾的比赛;这些游戏使用较长时间的控件。显示95％置信区间。图4b，AlphaGo在单台机器上的性能，适用于不同的组件组合。仅使用策略网络的版本不执行任何搜索。图4c，使用异步搜索(浅蓝色)或分布式搜索(深蓝色)，使用搜索线程和GPU在AlphaGo中进行蒙特卡罗树搜索的可伸缩性研究，每次落子2秒。

Figure 4: Tournament evaluation of AlphaGo. a Results of a tournament between different Go programs (seeExtended Data Tables 6 to 11). Each program used approximately 5 secondscomputation time per move. To provide a greater challenge to AlphaGo, some programs (pale upper bars)were given 4 handicap stones (i.e. free moves at the start of every game)against all opponents. Programs were evaluated on an Elo scale 30: a 230 point gap correspondsto a 79% probability of winning, which roughly corresponds to one amateur danrank advantage on KGS 31; an approximate correspondence to human ranks is alsoshown, horizontal lines show KGS ranks achieved online by that program. Gamesagainst the human European champion Fan Hui were also included; these gamesused longer time controls. 95% confidence intervals are shown. b Performance of AlphaGo, on a single machine, for different combinations ofcomponents. The version solely using the policy network does not perform anysearch. c Scalability study ofMonte-Carlo tree search in AlphaGowith search threads and GPUs, using asynchronous search (light blue) ordistributed search (dark blue), for 2 seconds per move.

我们同时评估了仅采用估值网络(λ=0)和持续落子(λ=1)的不同情况。即使关闭持续落子，AlphaGo性能也远超其他程序，这表明估值网络提供了一种代替蒙特卡洛模拟的方法。在对阵过程中采用混合评估(λ=0.5)时的性能最佳，胜率在95%以上。这表明两种估值模式是互补的：估值网络pρ可以预估性能优异但速度慢且可能存在行不通的结果，而持续落子能够准确计算和评估低性能但高速的落子策略pπ。图5显示了AlphaGo在真实对弈中的情形。

最后，我们评估了分布式AplhaGo对阵围棋2段樊麾的情形，樊麾是2013、2014和2015届欧洲围棋冠军。2015年10月5-9日,AlphaGo和樊麾对弈了5局，结果AlphaGo以5:0大获全胜。这是计算机围棋软件在没有让子情况下首次打败专业围棋选手，这一壮举曾被认为至少10年后才会发生。

图5：AlphaGo(执黑，玩)如何在非正式对弈樊麾中选择落子。对于以下每个统计信息，最大值的位置由橙色圆圈表示。图5a，使用值网络vθ(sʹ)评估根位置s的所有后继sʹ，其最高评估显示估计的获胜百分比。图5b，根据根位置s对树中每条边(s,a)的弈法行为-值变量Q(s,a)，仅对估值网络评估进行平均(λ=0)。图5c，弈法行为-值变量Q(s,a)，仅对首次展示评估进行平均(λ=1)。图5d，直接从SL策略网络落子概率pσ(a|s)，报告为百分比(如果高于0.1％)。图5e，在模拟期间从其根中选择操作的百分比频率。图5f，AlphaGo搜索树的主要变体(具有最大访问次数的路径)。落子以编号顺序呈现。AlphaGo选择了红色圆圈表示的落子；樊麾以白色方块表示的举弈法行为出回应；在他的赛后评论中，他更喜欢AlphaGo预测的落子(1)。

Figure 5: How AlphaGo (black, to play) selected its move in an informal gameagainst Fan Hui. For each of the following statistics, the location of themaximum value is indicated by an orange circle. a Evaluation of all successors sʹ of the root position s, using the value network vθ(sʹ); estimated winning percentages are shown for the topevaluations. b Action-values Q(s,a)for each edge (s,a) in the tree from root position s,averaged over valuenetwork evaluations only (λ=0). cAction-values Q(s,a), averaged over rollout evaluations only (λ=1). d Moveprobabilities directly from the SL policy network, pσ(a|s); reportedas a percentage (if above 0.1%). e Percentage frequency with whichactions were selected from the root during simulations. f The principal variation (path with maximum visit count) from AlphaGo’s search tree. The moves arepresented in a numbered sequence. AlphaGoselected the move indicated by the red circle; Fan Hui responded with the moveindicated by the white square; in his post-game commentary he preferred themove (1) predicted by AlphaGo.

To evaluate AlphaGo, we ran an internal tournamentamong variants of AlphaGo and severalother Go programs, including the strongest commercial programs Crazy Stone13and Zen, andthe strongest open source programs Pachi14and Fuego15. All of these programs arebased on high-performance MCTS algorithms. In addition, we included the opensource program GnuGo, a Go program using state-of-the-art search methods thatpreceded MCTS. All programs were allowed 5 seconds of computation time permove.

The results of the tournament(see Figure 4,a) suggest that single machine AlphaGo is many dan ranksstronger than any previous Go program, winning 494 out of 495 games (99.8%)against other Go programs. To provide a greater challenge to AlphaGo, we also played games with 4handicap stones (i.e. free moves for the opponent); AlphaGo won 77%, 86%, and 99% of handicap games against Crazy Stone, Zen and Pachi respectively. The distributed version of AlphaGo was significantly stronger,winning 77% of games against single machine AlphaGoand 100% of its games against other programs.

We also assessed variants of AlphaGo that evaluated positions usingjust the value network (λ=0) or just rollouts (λ=1) (see Figure 4,b). Evenwithout rollouts AlphaGo exceeded theperformance of all other Go programs, demonstrating that value networks providea viable alternative to Monte-Carlo evaluation in Go. However, the mixedevaluation (λ=0.5) performed best, winning ≥95% against other variants. Thissuggests that the two position evaluation mechanisms are complementary: thevalue network approximates the outcome of games played by the strong butimpractically slow pρ, while the rollouts can precisely score andevaluate the outcome of games played by the weaker but faster rollout policy pπ.Figure 5 visualises AlphaGo’sevaluation of a real game position.

Finally, we evaluated thedistributed version of AlphaGoagainst Fan Hui, a professional 2 dan,and the winner of the 2013, 2014 and 2015 European Go championships. On 5–9thOctober 2015 AlphaGo and Fan Huicompeted in a formal five game match. AlphaGowon the match 5 games to 0 (see Figure 6 and Extended Data Table 1). This isthe first time that a computer Go program has defeated a human professionalplayer, without handicap, in the full game of Go; a feat that was previouslybelieved to be at least a decade away3, 7,32.

7.讨论Discussion

在这项工作中，我们开发了一个基于深度神经网络和树搜索组合的围棋程序，与最强人类棋手水平上对弈，从而实现人工智能的“重大挑战”31-33之一。我们首次开发了围棋有效落子选择和位置评估功能函数，基于深度神经网络，这些网络通过监督和强化学习的新颖组合进行训练。我们引入了一种新的搜索算法，该算法成功地将神经网络估值与蒙特卡罗走子相结合。我们的程序AlphaGo在大规模的高性能树搜索引擎中将这些组件集成在一起。

在与樊麾的比赛中，AlphaGo评估的落子位置比深蓝与卡斯帕罗夫4的国际象棋比赛中所做落子位置少了数千倍；通过使用策略网络更智能地选择那些落子位置，使用估值网络更精确地评它们进行补偿——这种方法可能更近似于人类对弈方式。此外，深蓝(Deep Blue)依靠人工评估功能，而AlphaGo的神经网络则直接通过监督学习和强化学习在游戏中进行自主学习。

图6：AlphaGo对弈人类欧洲冠军樊麾的比赛。落子按对弈顺序相应编号顺序显示。在同一交叉点上的重复落子，在板下方成对显示。每对首个落子数字指当重复落子时，在由第二个落子数字标识的交叉点处。

Figure 6: Games from the match between AlphaGo and the human European champion,Fan Hui. Moves are shown in a numbered sequence corresponding to the order inwhich they were played. Repeated moves on the same intersection are shown inpairs below the board. The first move number in each pair indicates when therepeat move was played, at an intersection identified by the second movenumber.

围棋在很多方面都是典型的人工智能所面临困难33,34：一种具有挑战性的决策-执行任务，具有难以处理的搜索空间，以及十分复杂的最优解方案，看起来通过策略和估值函数直接逼近似乎不可行。前期在计算机围棋程序上的突破，主要体现在引入了蒙特卡洛树搜索算法(MCTS)，提高了许多其他领域的相应前进，比如在一般游戏对弈、经典计划型、部分观察计划型、调度和约束满足35,36(给出最优解)。通过将树搜索与策略、估值网络相结合，AlphaGo最终达到了专业围棋选手水平，极大地提高了人工智能进军其他领域的机会。为其他看似棘手的人工智能领域实现人类级别的智能提供了希望。

在线内容方法以及任何其他扩展数据显示项目和源数据可在本文的在线版本中找到；这部分独有参考仅出现在在线文件中。

2015年11月11日收到；2016年1月5日接受。

In this work wehave developed a Go program, based on a combination of deep neural networks andtree search, that plays at the level of the strongest human players, therebyachieving one of artificial intelligence’s “grand challenges”31–33.We have developed, for the first time, effective move selection and positionevaluation functions for Go, based on deep neural networks that are trained bya novel combination of supervised and reinforcement learning. We haveintroduced a new search algorithm that successfully combines neural networkevaluations with Monte Carlo rollouts. Our program AlphaGo integrates these components together, at scale, in ahigh-performance tree search engine.

During the match against Fan Hui, AlphaGo evaluated thousands of times fewer positions than Deep Bluedid in its chess match against Kasparov4; compensating by selectingthose positions more intelligently, using the policy network, and evaluatingthem more precisely, using the value network—an approach that is perhaps closerto how humans play. Furthermore, while Deep Blue relied on a handcraftedevaluation function, the neural networks of AlphaGoare trained directly from gameplay purely through general-purpose supervisedand reinforcement learning methods.

Go is exemplary in many ways of the difficulties faced byartificial intelligence33,34: a challenging decision-making task, anintractable search space, and an optimal solution so complex it appearsinfeasible to directly approximate using a policy or value function. Theprevious major breakthrough in computer Go, the introduction of MCTS, led tocorresponding advances in many other domains; for example, generalgame-playing, classical planning, partially observed planning, scheduling, andconstraint satisfaction35,36. By combining tree search with policyand value networks, AlphaGo hasfinally reached a professional level in Go, providing hope that human-levelperformance can now be achieved in other seemingly intractable artificialintelligence domains.

Online Content Methods, along with any additional Extended Data displayitems and Source Data, are available in the online version of the paper;references unique to these sections appear only in the online paper.

Received11 November 2015; accepted 5 January 2016

(余略，请赞赏后下载PDF完整版)

8.方法Methods

(余略，请赞赏后下载PDF完整版)

9.参考文献References

1. Allis, L. V. Searching for Solutions inGames and Artificial Intelligence. PhD thesis, Univ. Limburg, Maastricht, TheNetherlands (1994)

2. van den Herik, H., Uiterwijk, J. W. &van Rijswijck, J. Games solved: now and in the future. Artif. Intell. 134,277–311 (2002)

3. Schaeffer, J. The games computers (andpeople) play. Advances in Computers 52, 189–266 (2000)

(余略，请赞赏后下载PDF完整版)

61. Dean,J. et al. Large scale distributed deep networks. In Advances in NeuralInformation Processing Systems, 1223–1231 (2012)

62. Goratings. http://www.goratings.org

12.扩展数据Extended data

13.致谢Acknowledgements

13.作者信息Author Contributions

这些作者在这项工作作出了同等贡献。

戴维·斯尔弗(DavidSilver)和黄士杰；

(余略，请赞赏后下载PDF完整版)

13.4延伸阅读Further reading

《X机械学——一个无尽的前沿》杨伟、王宏涛、李铁峰和曲少新，科学中国物理，力学与天文学(2019)。

X-Mechanics—An endless frontier,https://doi.org/10.1007/s11433-018-9274-6

Wei Yang, HongTao Wang, TieFeng Li& ShaoXing Qu- Show fewer authors

Science China Physics, Mechanics& Astronomy (2019)

14.评论Comments

提交评论即表示您同意遵守我们的条款和社区准则。如果您发现滥用或不符合我们的条款或准则，请将其标记为不当。

By submitting a comment you agreeto abide by our Terms and Community Guidelines. If you find something abusiveor that does not comply with our terms or guidelines please flag it asinappropriate.

(译注2：感谢翻译中Dr何万青Dr余凯ETS颜为民等人建议。欢迎大家关注译文质量。)

B深度神经网络经典算法***AlphaGo*论文指标和译后(8122**字)

深度神经网络经典算法AlphaGo论文指标和译后

文|秦陇纪，数据简化DataSimp20181013Sat-1103Sat

1.自然研究期刊文章指标A Nature ResearchJournal Article metrics

自然研究期刊做的《精通围棋博弈的深层神经网络和树搜索算法》文章指标(最后更新时间：周一，2018年12月24日07:33:36 GMT)。(译注3：Article metrics for:Mastering the game of Go with deep neuralnetworks and tree search，https://www.nature.com/articles/nature16961/metrics[21])

1.1文章指标Article metrics

本文章指标(Articlemetrics)表明：

1. 此文总引用次数：•1076Web ofScience•1245交叉引用；

2. 文章指标得分：•推特2083条•博客61条•有27个Facebook页面•在41个Google+帖子中提及•由260家新闻媒体挑选•3Reddit•1F1000•3视频•15维基百科•6809位门德利(Mendeley)读者•28位引用相似(Citeulike)读者。

图1：总引用次数•1076Web of Science•1245交叉引用

图2：在线关注

此文章指标分数表示该文章是：

•所有期刊中341,140个相似年龄的跟踪文章的第99个百分位(排名第7)

•在自然界中相似年龄的988篇跟踪文章的第99百分位(排名第3)

3. 提及新闻，博客和Google+(请看英文表格)

4. 新闻文章(260)：

•NVIDIA凭借强大的新芯片The Verge在人工智能方面做出了重大贡献

•计算机在Go CNN Money的古代游戏中击败了人类

•谷歌的人工智能只是破解了据称没有计算机可以击败石英的游戏

•Un ordenador de Google vence al hombre en un complicadojuego de estrategia ABC.es

•2016年1月28日星期四Le Figaro的新闻报道Le Figaro

•这就是为什么计算机在电脑上获胜是如此重要的嗡嗡声

•Milepæl：计算机slåreuropamesteribrætspillet去Videnskab.dk

•Felmentettk a vrsiszap-tragdia vdlottjait Origo

科学博客(61)：略；

Google+的帖子(41)：略；

Twitter人口统计数据：略；

图3：全球推文(Twitter)用户热力图

1.2术语和方法的解释

1. 来源

Web of Science，CrossRef和Altmetric

2. 引文

来自每个服务的数据库的文章引用的单个数字可能因服务而异。引用次数取决于Web ofScience和CrossRef中各个API的可用性。这些计数一旦可用就会每天更新。一旦引用计数可用，就可以通过单击该引文来源的圆圈来访问引用此文章的文章列表。

3. 新闻，博客和Google+信息

各个主流新闻来源，博客文章或Google+成员引用文章的次数以及原始文章或帖子的链接。新闻文章，博客文章和Google+信息并不总是以可由Altmetric使用的聚合器选取的方式链接到文章，因此列出的链接不一定反映整个媒体，博客或Google+兴趣范围。此外，所涵盖的博客和新闻来源列表由Altmetric手动策划，因此可以自行决定是否包含在科学博客或媒体来源中。新闻，博客和Google+信息由Altmetric提供，每小时更新一次。

4. 指标得分

本指标根据文章收到的在线关注度计算得分。圆圈中的每个彩色线程代表不同类型的在线注意，中心的数字是本指标分数。该分数是根据在线关注的两个主要来源计算得出的：社交媒体和主流新闻媒体。本指标还跟踪在线参考管理器(如Mendeley和CiteULike)的使用情况，但这些不会对分数产生影响。较旧的文章通常会得分较高，因为他们有更多的时间来引起注意。为了解释这一点，指标已经包含了“相似年龄”的文章的上下文数据(在本文发布日期的任何一方的6周内发布)。

有关Altmetric，本指标分数和使用的源的更详细说明，请参阅本指标信息页面。

5. Twitter人口统计数据

提供按Twitter帐户的原始国家/地区细分的推文数量。Twitter源的地理分类由Altmetric提供，每小时更新一次。

Nature ISSN 1476-4687(在线)

nature>articles>article>articlemetrics

A NatureResearch Journal

Articlemetrics for: Mastering the game of Go with deep neuralnetworks and tree search

Last updated:Mon, 24 Dec 2018 07:33:36 GMT

Back toarticle page

Totalcitations

· 1076 Web of Science

· 1245 CrossRef

Onlineattention

Altmetricscore (what's this?)

· Tweeted by 2083

· Blogged by 61

· On 27 Facebook pages

· Mentioned in 41Google+ posts

· Picked up by 260 newsoutlets

· 3 Reddit

· 1 F1000

· 3 Video

· 15 Wikipedia

· 6809 readers onMendeley

· 28 readers onCiteulike

Showless

This Altmetricscore means that the article is:

· in the 99thpercentile (ranked 7th) of the 341,140 tracked articles of a similarage in all journals

· in the 99thpercentile (ranked 3rd) of the 988 tracked articles of a similar agein Nature

Mentions innews, blogs & Google+

News articles (260)

· NVIDIAbets big on AI with powerful new chip The Verge

· Computerdefeats human in ancient game of Go CNN Money

· Google’sAI just cracked the game that supposedly no computer could beat Quartz

· Unordenador de Google vence al hombre en un complicado juego de estrategia ABC.es

· Newsstory from Le Figaro on Thursday 28 January 2016 Le Figaro

· ThisIs Why A Computer Winning At Go Is Such A Big Deal Buzzfeed

· Milepæl:Computer slår europamester i brætspillet go Videnskab.dk

· Felmentettka vrsiszap-tragdia vdlottjait Origo

Scientific blogs (61)

Google+ posts (41)

Twitterdemographics

+-0 tweets inAntarctica

Country	Tweets	% of Tweets
United States	331	15.89%
Japan	178	8.55%
United Kingdom	114	5.47%
France	47	2.26%
South Korea	46	2.21%
Germany	39	1.87%
Spain	28	1.34%
Netherlands	27	1.30%
Canada	27	1.30%
India	14	0.67%
Australia	13	0.62%
Mexico	13	0.62%
China	12	0.58%
Switzerland	11	0.53%
Uruguay	9	0.43%
Austria	8	0.38%
Brazil	8	0.38%
Italy	8	0.38%
Ireland	7	0.34%
Sweden	7	0.34%
Argentina	7	0.34%
Belgium	7	0.34%
Finland	5	0.24%
Singapore	4	0.19%
Norway	4	0.19%
Poland	3	0.14%
Belarus	3	0.14%
Colombia	3	0.14%
Thailand	3	0.14%
Venezuela	3	0.14%
South Africa	2	0.10%
Côte d’Ivoire	2	0.10%
Latvia	2	0.10%
Russia	2	0.10%
Greece	2	0.10%
Bosnia & Herzegovina	2	0.10%
Indonesia	2	0.10%
Denmark	2	0.10%
Maldives	2	0.10%
North Korea	2	0.10%
Chile	2	0.10%
Hungary	2	0.10%
Paraguay	2	0.10%
Hong Kong SAR China	2	0.10%
Taiwan	2	0.10%
Nigeria	2	0.10%
Moldova	2	0.10%
Nepal	1	0.05%
Papua New Guinea	1	0.05%
Luxembourg	1	0.05%
Libya	1	0.05%
Slovenia	1	0.05%
Czech Republic	1	0.05%
Cyprus	1	0.05%
Estonia	1	0.05%
Ukraine	1	0.05%
Andorra	1	0.05%
Portugal	1	0.05%
Pakistan	1	0.05%
Equatorial Guinea	1	0.05%
Honduras	1	0.05%
Turkey	1	0.05%
Kenya	1	0.05%
Laos	1	0.05%
Guatemala	1	0.05%
Palestinian Territories	1	0.05%
El Salvador	1	0.05%
New Zealand	1	0.05%
Samoa	1	0.05%
Uzbekistan	1	0.05%
Mauritania	1	0.05%
Madagascar	1	0.05%
Peru	1	0.05%
Malaysia	1	0.05%
Slovakia	1	0.05%
Uganda	1	0.05%
Togo	1	0.05%
Saudi Arabia	1	0.05%
Ecuador	1	0.05%
Panama	1	0.05%
No location data	1,027	49.30%

Explanation ofterms and methodology

Sources

Web ofScience, CrossRef and Altmetric

Citations

Single numbercount for article citations from each service's database may vary by service.The citations counts are reliant on the availability of the individual APIsfrom Web of Science and CrossRef. These counts are updated daily once theybecome available. Once a citation count is available, the list of articles citingthis one is accessible by clicking on the circle for that citation source.

News, blogsand Google+ posts

The number oftimes an article has been cited by individual mainstream news sources, blogpost, or member of Google+ along with a link to the original article or post.News articles, blog posts and Google+ posts do not always link to articles in away that can be picked up by aggregators used by Altmetric, so the listed linksare not necessarily a reflection of the entire scope of media, blog or Google+interest. Further, the list of blogs and news sources covered is manuallycurated by Altmetric and thus is subject to their discretion for inclusion as ascientific blog or media source. The news, blog, and Google+ posts are providedby Altmetric and are updated hourly.

Altmetricscore

Altmetriccalculates a score based on the online attention an article receives. Eachcoloured thread in the circle represents a different type of online attentionand the number in the centre is the Altmetric score. The score is calculatedbased on two main sources of online attention: social media and mainstream newsmedia. Altmetric also tracks usage in online reference managers such asMendeley and CiteULike, but these do not contribute to the score. Olderarticles will typically score higher because they have had more time to getnoticed. To account for this, Altmetric has included the context data forarticles of a "similar age" (published within 6 weeks of either sideof the publication date of this article).

For a moredetailed description of Altmetric, the Altmetric score, and sources used,please see Altmetric'sinformation page.

Twitter demographics

Provides thenumber of tweets broken down by country of origin for the Twitter account. Thegeographic breakdown for the twitter sources is provided by Altmetric and isupdated hourly.

Nature

ISSN 1476-4687(online)

2.深度神经网络经典算法AlphaGo论文译后花絮

数据简化社区第一篇经典论文、最先翻译中文的深度神经网络经典场景算法论文《Mastering thegame of Go with deep neural networks and tree search》再次审读一次，并配上英语原文变成双语对照版(附全文PDF下载链接)，以飨读者。秦陇纪这一个月来，将2016年3月16日首发于“数据精简DataSimp”(现名：数据简化DataSimp)公众号的英国麦克米伦出版公司旗下《自然》杂志2016年1月28日第529卷484-489页，戴维·斯尔弗(davidsilver@google.com)或Demis Hassabis戴密斯·哈萨比斯(demishassabis@google.com)论文《精通围棋博弈的深层神经网络和树搜索算法》汉语译本(原文《自然》杂志官网http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html)重新译遍，于2018年11月8日重排。原英文包含20图7表8个注释42参考文献，合计18页58k字母(8702单词)，本中英文对照版本包含25页58902字(含标点66751字)687段，外加4个译注和2个附录，为Word中英文字符文件。

为了避免低质量的识字汉译媒体误导大众，帮助大数据/人工智能领域圈内学子，数据简化DataSimp社区以提供高质量专业性科普作品，方便大家共同学习BD/AI/NLP/KG领域科学技术为目标。欢迎圈内研究员加入。水平有限、错误在所难免，请直接留言或电邮至QinDragon2010@qq.com指正。

(译注1*：原文见《自然》杂志官网http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html，《自然》2016年1月28日第529卷484-489页，©英国麦克米伦出版公司2016保留所有权利。网站提示：邮件可发至DavidSilver戴维·斯尔弗(davidsilver@google.com)或Demis Hassabis戴密斯·哈萨比斯(demishassabis@google.com)。

doi:10.1038/nature16961中文译本2016年3月16日首发于“数据精简DataSimp”(现名：数据简化DataSimp)公众号。译者基于“忠于原文”原则翻译，中文译者：秦陇纪-数据简化DataSimp(主要)，姬向军-陕西师范大学，杨武霖-中国空间技术研究院，池绍杰-北京工业大学。时间仓促，疏漏之处难免，敬请提出宝贵意见。

AlphaGo算法论文《精通围棋博弈的深层神经网络和树搜索算法》汉译中英对照 (《自然》nature16961原文翻译，机器学习经典)

2016-03-16数据精简DataSimp英译组秦陇纪等人译 数据简化DataSimp

(点击标题下「数据简化DataSimp」文字链接后继续点击“关注”接收本号信息)

语音、文字、图片可对话聊天提问，挑战比阿尔法狗AlphaGo更厉害的语音图文多媒体数据人工智能。欢迎关注,回复,点赞,分享朋友圈,转发,转载本公众号文章。(请注明作者、出处、时间等信息，如“此文转自：数据简化DataSimp英译组秦陇纪等人；微信公号：数据简化DataSimp；2016.3.15Tue译著。”字样，详情邮件咨询QinDragon2010@qq.com)本公号文章保留一切权利，如有引文出处不明或遗漏、版权问题等请给公号回复消息留言；投稿邮箱QinDragon2010@qq.com，欢迎数据科学和人工智能学界、业界同仁赐稿。

公号文章目前涉及数据科学相关产学研论文和新闻，如数据产业现状、数据分析处理、人工智能对数据的抽象、信息和数据的流程简化、数据标准化、小数据和大数据关联简化等方面。未来推送数据科学和人工智能、大数据技术顶级团队和技术信息；推往全球主要数据科学家所在地，中英文同步直播最新产学研信息。谋求尽快达到创业层次做数据行业实业。

转载本公号文章请注明作者、出处、时间等信息，如“此文转自©微信公号：数据简化DataSimp，作者：秦陇纪等，时间：2016.3.15Tue译编2018.11.08重排。”字样，详情邮件咨询QinDragon2010@qq.com，转载请保留本信息。本公号文章保留一切权利，如有引文/译注/出处不明或遗漏、版权问题等，请给公号回复消息留言，或发邮件到DataSimp@126.com。欢迎数据科学和人工智能学界、产业界同仁赐稿，投稿邮箱DataSimp@126.com，范围：AI、语言处理、数据分析科学技术论文。

-End-

参考文献(3287字)

1. Silver D, Huang A, Maddison C J, et al. Mastering the game of Gowith deep neural networks and tree search[J]. nature, 2016, 529(7587): 484.

2. Silver, David, et al. "Mastering the game of Go with deepneural networks and tree search." nature 529.7587 (2016): 484.

略

21. 秦陇纪，数据精简DataSimp．AlphaGo算法论文《精通围棋博弈的深层神经网络和树搜索算法》汉译(《自然》nature16961原文翻译，机器学习经典版)．[EB/OL]；数据简化DataSimp，https://mp.weixin.qq.com/s?__biz=MzIwMTQ4MzQwNQ==&mid=405199501&idx=1&sn=53164113c932ee0aa9d1060f282bfbdf，2016-03-16，访问日期：2018-12-24．

22. A Nature Research Journal. Article metrics for: Mastering thegame of Go with deep neural networks and tree search.[EB/OL]; nature, https://www.nature.com/articles/nature16961/metrics,Last updated: Mon, 24 Dec 2018 07:33:36 GMT, visit data: 2018-12-25.

x. 秦陇纪．数据简化社区Python官网Web框架概述；数据简化社区2018年全球数据库总结及18种主流数据库介绍；数据科学与大数据技术专业概论；人工智能研究现状及教育应用；信息社会的数据资源概论；纯文本数据溯源与简化之神经网络训练；大数据简化之技术体系．[EB/OL]；数据简化DataSimp(微信公众号)，http://www.datasimp.org，2017-06-06．

中英AlphaGo论文：精通围棋博弈的深层神经网络和树搜索算法(51654字)

(标题下「数据简化DataSimp」文字链接，点击后继续点“关注”接收推送)

简介：中英AlphaGo论文：精通围棋博弈的深层神经网络和树搜索算法。作者：谷歌DeepMind围棋组。来源：自然期刊/ CC BY-NC非商业授权/数据简化社区/秦陇纪微信群聊公众号，参考文献附引文出处。公号输入栏回复关键字“AlphaGo论文”或文末链接“阅读原文”可下载本文83k字2表16图25页PDF资料；标题下蓝链接“数据简化DataSimp”关注后，菜单项有文章分类页。

主编译者：秦陇纪，数据简化DataSimp/科学Sciences/知识简化新媒体创立者，数据简化社区创始人，数据简化OS设计师/架构师，ASM/Cs/Java/Python/Prolog程序员，英语/设计/IT教师。每天大量中英文阅读/设计开发调试/文章汇译编简化，时间精力人力有限，欢迎支持加入社区。版权声明：科普文章仅供学习研究，公开资料©版权归原作者，请勿用于商业非法目的。秦陇纪2018数据简化DataSimp综合汇译编，投稿合作、转载授权/侵权、原文引文错误等请联系DataSimp@126.com沟通。社区媒体：“数据简化DataSimp、科学Sciences、知识简化”新媒体聚集专业领域一线研究员；研究技术时也传播知识、专业视角解释和普及科学现象和原理，展现自然社会生活之科学面。秦陇纪发起，期待您参与各领域；欢迎分享、赞赏、支持科普~~

Appx(1236字).数据简化DataSimp社区简介

信息社会之数据、信息、知识、理论持续累积，远超个人认知学习的时间、精力和能力。必须行动起来，解决这个问题。应对大数据时代的数据爆炸、信息爆炸、知识爆炸，解决之道重在数据简化(Data Simplification)：简化减少知识、媒体、社交数据，使信息、数据、知识越来越简单，符合人与设备的负荷。(秦陇纪，2010)

数据简化DataSimp年度会议(DS2010-2019)，聚焦数据简化技术(Data Simplification Techniques)：对各类数据从采集、处理、存储、阅读、分析、逻辑、形式等方面做简化，应用于信息及数据系统、知识工程、各类数据库、物理空间表征、生物医学数据，数学统计、自然语言处理、机器学习技术、人工智能等领域。欢迎数据科学技术、简化实例相关论文投稿加入数据简化社区，参加会议出版专著。请投会员邮箱DataSimp@163.com，详情访问社区网站www.datasimp.org。填写申请表加入数据简化DataSimp社区成员，应至少有一篇数据智能、编程开发IT文章：①原创数据智能科技论文；②数据智能工程技术开源程序代码；③翻译美欧数据智能科技论文；④社区网站发帖人管理员版主志愿者义工；⑤完善黑白静态和三彩色动态社区S圈型LOGO图标。DataSimplification/Sciences/Knowledge Simplification Public Accounts——DataSimp@163.com, 2018.12.12Wed,Xi'an, Shaanxi, China.

LIFE

Lifebegins at the end of your comfort zone.——Neale Donald Walsch

THEDAY

Thestrength of purpose and the clarity of your vision，alongwith the tenacity to pursue it，is your underlying driver ofsuccess.——Ragy Tomas

投稿QQ群223518938数据简化DataSimp社区；技术公众号“数据简化DataSimp”留言，或(备注：姓名/单位-职务/学校-专业/手机号)加微信账号QinlongGEcai，进“数据简化DataSimp社区”投稿群或“科学Sciences学术文献”读者群等群聊。关注如下三个公众号(搜名称也行)，关注后底部菜单有文章分类页链接。

数据技术公众号“数据简化DataSimp”：

科普公众号“科学Sciences”：

社会教育知识公众号“知识简化”：

普及科学知识，分享到朋友圈

转发/留言/打赏后“阅读原文”下载PDF

阅读原文

微信扫一扫
关注该公众号