【自然语言处理】【ChatGPT系列】Chain of Thought：从大模型中引导出推理能力

Chain-of-Thought Prompting：从大模型中引导出推理能力 《Chain-of-Thought Prompting Elicits Reasoning in Large Language Models》

论文地址：https://arxiv.org/pdf/2201.11903.pdf

相关博客
【自然语言处理】【ChatGPT系列】WebGPT：基于人类反馈的浏览器辅助问答
【自然语言处理】【ChatGPT系列】ChatGPT的智能来自哪里？
【自然语言处理】【ChatGPT系列】Chain of Thought：从大模型中引导出推理能力
【自然语言处理】【ChatGPT系列】InstructGPT：遵循人类反馈指令来训练语言模型
【自然语言处理】【ChatGPT系列】大模型的涌现能力
【自然语言处理】【文本生成】CRINEG Loss：学习什么语言不建模
【自然语言处理】【文本生成】使用Transformers中的BART进行文本摘要
【自然语言处理】【文本生成】Transformers中使用约束Beam Search指导文本生成
【自然语言处理】【文本生成】Transformers中用于语言生成的不同解码方法
【自然语言处理】【文本生成】BART：用于自然语言生成、翻译和理解的降噪Sequence-to-Sequence预训练
【自然语言处理】【文本生成】UniLM：用于自然语言理解和生成的统一语言模型预训练
【自然语言处理】【多模态】OFA：通过简单的sequence-to-sequence学习框架统一架构、任务和模态

一、简介

语言模型为自然语言处理带来了革命，而扩大语言模型规模可以提高下游任务效果、样本效率等一系列的好处。然而，单纯扩大语言模型的尺寸并不能够使算术、常识和符号推理获得更好的表现。文本尝试使用简单的方法来解锁大规模语言模型的推理能力，该方法主要来自于两个想法：(1) 算术推理能够从自然语言论据中受益，从而得到最终的答案。先前的研究通过从头训练或者微调预训练模型从而赋予模型生成自然语言中间步骤的能力。(2) 大规模语言模型通过prompting\text{prompting}prompting提供了few-shot learning\text{few-shot learning}few-shot learning的能力。也就是说，不需要为每个任务微调单独的语言弄下checkpoint，而是通过任务相关的"输入-输出"示例来提示模型。

然而，上面的想法有一些限制。论据增强的训练和微调方法需要大量的高质量论据，这比简单的"输入-输出"样本对复杂的多。传统的few-shot prompting\text{few-shot prompting}few-shot prompting方法在需要推理能力的任务上表现很差，并且不会随着语言模型规模的增加而改善。在本文中，作者以某种方式合并两个想法中的能力来避免这些限制。具体来说，给定一个由三元组组成的prompt:⟨input, chain of thought, output⟩\text{prompt}:\langle\text{input, chain of thought, output}\rangleprompt:⟨input, chain of thought, output⟩，探索大模型在推理任务上的few-shot prompting\text{few-shot prompting}few-shot prompting效果。chain of thought\text{chain of thought}chain of thought是由一系列自然语言推理中间步骤组成，并最终给出答案，将该方法称为chain-of-thought prompting\text{chain-of-thought prompting}chain-of-thought prompting。图1是一个示例prompt\text{prompt}prompt。

本文给出了一个在算法、常识和符号推理基准上的评估，结果显示Chain-of-thought prompting\text{Chain-of-thought prompting}Chain-of-thought prompting显著优于标准的prompting\text{prompting}prompting，有时能够达到惊人的程度。上图展示了chain-of-thought prompting\text{chain-of-thought prompting}chain-of-thought prompting在GSM8K\text{GSM8K}GSM8K数学应用题基准上的结果，PaLM 540B\text{PaLM 540B}PaLM 540B大幅度超越了标准的prompt\text{prompt}prompt并实现了新的state-of-the-art\text{state-of-the-art}state-of-the-art。prompt\text{prompt}prompt方法很重要，因为其不需要大的训练集并且能够不失一般性地在单个模型checkpoint执行很多任务。本文的目标是赋予语言模型生成类似chain of thought\text{chain of thought}chain of thought的能力，即一系列中间推理步骤，从而得到问题最终的答案。本文展示了，在few-shot prompting\text{few-shot prompting}few-shot prompting中提供一些chain-of-thought\text{chain-of-thought}chain-of-thought推理的示例，足够大的语言模型就能生成chains of thought\text{chains of thought}chains of thought。

二、Chain-of-Thought Prompting\text{Chain-of-Thought Prompting}Chain-of-Thought Prompting

回想一下人类解决数学应用题这种复杂推理任务的过程。典型的做法是将问题分解为中间步骤并逐步解决并给出最终的答案：“Jane将2朵花送给她妈妈后还剩10朵…然后再送给她爸爸3朵后还有7朵…所以答案是7”。

考虑一下我们自己解决像数学应用题这样复杂推理任务的过程。典型的做法是将问题分解为中间步骤并逐步解决并给出最终的答案：“Jane将2朵花送给她妈妈后还剩10朵…然后再送给她爸爸3朵后还有7朵…所以答案是7”。本文的目标是赋予语言模型生成类似chain of thought\text{chain of thought}chain of thought的能力，即一系列中间推理步骤，从而得到问题最终的答案。我们将会展示，若在few-shot prompting\text{few-shot prompting}few-shot prompting提供一些chain-of-thought\text{chain-of-thought}chain-of-thought推理的示例，足够大的语言模型就能生成chains of thought\text{chains of thought}chains of thought。

图1展示了一个模型产生chain of thought\text{chain of thought}chain of thought来解决数学应用题的示例。在该例子中，chain of thought\text{chain of thought}chain of thought类似于一种解决方案，其能够一步一步思考并给出最终答案。

Chain-of-thought prompting\text{Chain-of-thought prompting}Chain-of-thought prompting作为利用语言模型推理能力的方法，有几个吸引人的性质：

(1) chain of thought\text{chain of thought}chain of thought允许模型将多步推理问题分解为中间步骤，这意味着额外的计算可以分配到需要推理的问题上；
(2) chain of thought\text{chain of thought}chain of thought为模型的行为提供了一个可解释的窗口，并提供了调试推理路径错误的机会；
(3) chain-of-thought\text{chain-of-thought}chain-of-thought推理能够被用于数学应用题、常识推理和符号操作等任务，并且可能适用任何人类需要通过语言解决的问题；
(4) chain-of-thought\text{chain-of-thought}chain-of-thought可以通过将其加入到few-shot prompting\text{few-shot prompting}few-shot prompting示例中，从而在足够大的语言模型中引导出推理能力。

三、算术推理

首先通过数学应用题来衡量语言模型的数学推理能力。虽然数学推理能力对人类很简单，但是对模型来说十分挣扎。当具有540B\text{540B}540B参数语言模型上使用chain-of-thought prompting\text{chain-of-thought prompting}chain-of-thought prompting，其能与在几个任务上微调的效果相当，甚至在具有挑战性的GSM8K\text{GSM8K}GSM8K基准实现了新的state-of-the-art\text{state-of-the-art}state-of-the-art。

1. 实验设置

基准

考虑5个数学应用题基准：(1) 数学应用题基准GSM8K\text{GSM8K}GSM8K; (2) 具有不同结构的数学应用题基准SVAMP\text{SVAMP}SVAMP; (3) 具有多种数学应用题的基准ASDiv\text{ASDiv}ASDiv; (4) 代数应用题数据集AQuA\text{AQuA}AQuA；(5) MAWPS\text{MAWPS}MAWPS基准。
标准prompting\text{prompting}prompting

baseline是标准的few-shot prompting\text{few-shot prompting}few-shot prompting，给模型一个"输入-输出"对示例，然后模型就可以在测试时给出答案。
Chain-of-thought prompting\text{Chain-of-thought prompting}Chain-of-thought prompting

本文提出的方法是通过一个关联了答案的chain of thought\text{chain of thought}chain of thought来增强few-shot prompting\text{few-shot prompting}few-shot prompting示例。由于大多数的数据集仅有一个评估集，作者手动构建了8个具有chain of thought\text{chain of thought}chain of thought的few-shot\text{few-shot}few-shot示例集合，图1右侧展示了一个chain of thought\text{chain of thought}chain of thought示例。在除了AQuA\text{AQuA}AQuA以外的所有基准上是使用了这8个示例集合。对于AQuA\text{AQuA}AQuA，这使用来自训练集的4个示例。
语言模型

本文评估了5个大模型。

(1) GPT-3\text{GPT-3}GPT-3，使用的版本为text-ada-001\text{text-ada-001}text-ada-001、text-babbage-001\text{text-babbage-001}text-babbage-001、text-curie-001\text{text-curie-001}text-curie-001和text-davinci-002\text{text-davinci-002}text-davinci-002，大致对应InstructGPT\text{InstructGPT}InstructGPT模型的350M、1.3B、6.7B、175B\text{350M、1.3B、6.7B、175B}350M、1.3B、6.7B、175B参数量；

(2) LaMDA\text{LaMDA}LaMDA，其具有442M、2B、8B、68B、137B\text{442M、2B、8B、68B、137B}442M、2B、8B、68B、137B参数量的版本；

(3) PaLM\text{PaLM}PaLM，其具有8B、62B、540B\text{8B、62B、540B}8B、62B、540B参数量；

(4) UL2 20B\text{UL2 20B}UL2 20B；

(5) CodeX\text{CodeX}CodeX。

通过贪心解码的方法来采样。对于LaMDA\text{LaMDA}LaMDA，本文报告了5个随机种子的平均结果，且每个种子都对应不同的样本顺序。由于LaMDA\text{LaMDA}LaMDA在不同种子中并没有显示出大的方差，为了节省计算，对所有的其他模型仅报告单个样本的结构。

2. 结果

chain-of-thought prompting\text{chain-of-thought prompting}chain-of-thought prompting最优的结果如上图所示，这里有三个关键要点。

首先、上图展示了chain-of-thought prompting\text{chain-of-thought prompting}chain-of-thought prompting是模型规模的涌现能力，即chain-of-thought prompting\text{chain-of-thought prompting}chain-of-thought prompting对于小模型不会有正面的影响，仅当模型参数达到∼100B\sim\text{100B}∼100B参数才会有效果上的收益。作者定性的发现，较小的模型产生流畅但不合逻辑的chain-of-thought\text{chain-of-thought}chain-of-thought，导致其会比标准的prompting\text{prompting}prompting更差的表现。

第二、对于更复杂的问题chain-of-thought prompting\text{chain-of-thought prompting}chain-of-thought prompting具有更大的效果收益。例如，对于GSM8K\text{GSM8K}GSM8K，最大的GPT\text{GPT}GPT和PaLM\text{PaLM}PaLM模型效果翻了一倍。对于MAWPS\text{MAWPS}MAWPS基准中最简单的子集SingleOP\text{SingleOP}SingleOP，该子集中的问题仅需要一步就能解决，那么chain-of-thought prompting\text{chain-of-thought prompting}chain-of-thought prompting对于性能的改善要么是负的、要么非常小。

第三、基于GPT-3 175B\text{GPT-3 175B}GPT-3 175B和PaLM 540B\text{PaLM 540B}PaLM 540B的chain-of-thought prompting\text{chain-of-thought prompting}chain-of-thought prompting优于先前在标注的训练集上进行任务相关微调的state of the art\text{state of the art}state of the art模型。上图展示了PaLM 540B\text{PaLM 540B}PaLM 540B如何使用chain-of-thought prompting\text{chain-of-thought prompting}chain-of-thought prompting来在GSM8K,SVAMP,MAWPS\text{GSM8K,SVAMP,MAWPS}GSM8K,SVAMP,MAWPS实现新的state of the art\text{state of the art}state of the art。在AQuA\text{AQuA}AQuA和ASDiv\text{ASDiv}ASDiv这两个数据集上，PaLM\text{PaLM}PaLM能够达到state of the art\text{state of the art}state of the art。

为了更好的理解chain-of-thought\text{chain-of-thought}chain-of-thought为什么有效，作者手动检查了LaMDA 137B\text{LaMDA 137B}LaMDA 137B为GSM8K\text{GSM8K}GSM8K生成的chain-of-thought\text{chain-of-thought}chain-of-thought。在模型返回的50个正确的例子中，除了有两个是巧合得到答案外，所有生成的chain of thought\text{chain of thought}chain of thought都在逻辑和数学上是正确的。此外，作者也检查了模型返回的50个错误的例子，46%的chain of thought\text{chain of thought}chain of thought几乎是正确的，处理一些小的错误，其余54%的chain of thought\text{chain of thought}chain of thought在语义理解或一致性上有主要错误。为了能够深入理解规模能够改善chain of thought\text{chain of thought}chain of thought推理能力，作者对PaLM 62B\text{PaLM 62B}PaLM 62B所犯的错误提供了相似的分析，但模型放大至PaLM 540B\text{PaLM 540B}PaLM 540B是可以修复这些错误。总的来说，将PaLM\text{PaLM}PaLM放大至540B\text{540B}540B能够修复62B\text{62B}62B模型中大部分的单步错误和语义理解错误。

3. 消融实验

chain of thought prompting\text{chain of thought prompting}chain of thought prompting能够带来效果上的收益，那么自然会期望知道是否可以通过其他类型的prompting\text{prompting}prompting来获得相同的性能改善。上图展示了chain of thought\text{chain of thought}chain of thought三个变体的消融实验。

Equation only

chain of thought prompting\text{chain of thought prompting}chain of thought prompting有效的可能原因是产生了数学方程式。所以，本文测试了一个变体，模型被提示在给出答案之前仅输出一个数学方程式。上图表明Equation Only prompting\text{Equation Only prompting}Equation Only prompting对于GSM8K\text{GSM8K}GSM8K并没有多少帮助，这意味着不生成自然语言推理步骤而直接生成一个数学方程式对于模型还是太难。然而，对于单步或者两步问题，equation only prompting\text{equation only prompting}equation only prompting可以改善效果，因为方程能够轻易从问题中推导出来。
Variable compute only

另一个直觉是，chain of thought\text{chain of thought}chain of thought允许模型在更难的问题上花费更多的计算。为了从chain-of-thought\text{chain-of-thought}chain-of-thought推理中分离出变量计算的影响，本文测试了一种配置：其中模型被提出输出一个点序列，等于方程中字符数量需要解决的问题。这个变体的表现与baseline相同，其表明变量计算本身不是chain-of-thought prompting\text{chain-of-thought prompting}chain-of-thought prompting成功的原因，并且通过自然语言表达的中间步骤是有用的。
Chain of thought after answer

chain-of-thought prompting\text{chain-of-thought prompting}chain-of-thought prompting另一个潜在的收益可能是，prompts\text{prompts}prompts允许模型更好的访问在预训练中的相关知识。因此测试了另一种配置：chain of thought prompt\text{chain of thought prompt}chain of thought prompt仅在答案之后给出。这一变体的表现与baseline相同，这表明chain of thought\text{chain of thought}chain of thought中的序列推理能力有用的原因不仅仅是激活知识。

4. chain of thought\text{chain of thought}chain of thought鲁棒性

样本敏感性是prompting\text{prompting}prompting方法的主要考虑因素，例如：在SST-2\text{SST-2}SST-2基准上改变few-shot\text{few-shot}few-shot样本的顺序能够使GPT-3\text{GPT-3}GPT-3的准确率从54.3%提升至93.4%。本小节将评估不同标注者撰写的chain of thought\text{chain of thought}chain of thought鲁棒性。上图展示了LaMDA 137B\text{LaMDA 137B}LaMDA 137B在GSM8K\text{GSM8K}GSM8K和MAWPS\text{MAWPS}MAWPS的结果。虽然不同chain of thought\text{chain of thought}chain of thought标注间存在着差异，但所有的chain of thought prompt\text{chain of thought prompt}chain of thought prompt集合都大幅度的超越了标准的baseline。这个结果意味着chain of thought\text{chain of thought}chain of thought的成功运用不依赖特定的语言风格。

为了确认chain-of-thought prompting\text{chain-of-thought prompting}chain-of-thought prompting成功在其他例子中也适用，本文还从GSM8K\text{GSM8K}GSM8K训练集中随机采样三个8样本集合进行实验。上图显示，这些prompts\text{prompts}prompts的表现与我们人工编写的效果相当，也是显著优于标准的prompting\text{prompting}prompting。

除了标注者、独立编写的chain of thought\text{chain of thought}chain of thought、不同的示例和各种语言模型的鲁棒性，作者也发现chain-of-thought prompting\text{chain-of-thought prompting}chain-of-thought prompting对于数学推理能力在示例顺序和示例数量上具有鲁棒性。

四、常识推理

chain of thought\text{chain of thought}chain of thought特别适合于数学应用题。chain of thought\text{chain of thought}chain of thought基于语言的本质实际上使其适合于广泛的常识推理问题，其包括在一般背景知识的假设下，对物理和人类交互的推理。常识推理是与世界互动的关键，并且仍然超越了当前自然语言理解系统的范围。

1. 基准

本文考虑覆盖了各类常识推理类型的5个数据集。(1) CSQA\textbf{CSQA}CSQA数据集中包含关于世界的常识问题，其涉及到复杂的语义并且需要先验知识。StrategyQA\textbf{StrategyQA}StrategyQA需要模型推理多跳策略来回答问题。从BIG-bench\text{BIG-bench}BIG-bench基准中选择了两个专门的评估集：Date\textbf{Date}Date理解，其涉及到从给定的上下文中推断日期；Sports\textbf{Sports}Sports理解，其涉及到判断一个与体育相关的句子是否可信。最后，SayCan\textbf{SayCan}SayCan数据集涉及到映射自然语言指令到机器人动作序列。

2. Prompts\text{Prompts}Prompts

遵循先前章节相同的实验设置。对于CSQA\textbf{CSQA}CSQA和StrategyQA\textbf{StrategyQA}StrategyQA，随机从训练集挑选样本并手工构造chain of thought\text{chain of thought}chain of thought来作为few-shot\text{few-shot}few-shot示例。两个BIG-bench\text{BIG-bench}BIG-bench任务没有训练集，所以从评估集中选择前10个样本作为few-shot\text{few-shot}few-shot示例。对于SayCan\textbf{SayCan}SayCan，从训练集中挑选6个样本，然后人工构造chain of thought\text{chain of thought}chain of thought。

3. 结果

上图展示了PaLM\text{PaLM}PaLM的结果。对于所有的任务，放大模型尺寸能够改善标准prompting\text{prompting}prompting的效果，而chain of thought prompting\text{chain of thought prompting}chain of thought prompting能够带来进一步的收益，PaLM 540B\text{PaLM 540B}PaLM 540B的改善似乎是最大的。在使用chain of thought prompting\text{chain of thought prompting}chain of thought prompting的情况下，PaLM 540B\text{PaLM 540B}PaLM 540B相较于baseline实现了非常好的效果。在

上图重点介绍了PaLM\text{PaLM}PaLM的结果。对于所有的任务，放大模型尺寸能够改善标准prompting\text{prompting}prompting的效果；chain of thought prompting\text{chain of thought prompting}chain of thought prompting能够带来进一步的收益，PaLM 540B\text{PaLM 540B}PaLM 540B的改善似乎是最大的。使用chain of thought prompting\text{chain of thought prompting}chain of thought prompting，PaLM 540B\text{PaLM 540B}PaLM 540B相较于baselines实现了很强的表现。在StrategyQA\textbf{StrategyQA}StrategyQA超越了先前的state of the art\text{state of the art}state of the art(75.6% vs 69.4%)，并且在体育理解上优于一个无辅助的体育爱好者。这些结果表明，chain of thought prompting\text{chain of thought prompting}chain of thought prompting能够改善一系列需要尝试推理能力的任务。

五、符号推理

最后的实验会评估符号推理，其对于人类很简单，但是对语言模型非常有挑战。实验展示了chain-of-thought prompting\text{chain-of-thought prompting}chain-of-thought prompting不仅能够使语言模型执行符号推理任务，也能够促进在推理时的长度泛化到更长的未见过的few-shot\text{few-shot}few-shot示例。

1. 任务

末尾字符拼接

该任务要求模型将名字中每个单词末尾字符拼接起来(例如：Amy Brown→yn\text{Amy Brown}\rightarrow\text{yn}Amy Brown→yn)，其比首字符拼接更有挑战，因为首字符拼接在没有chain of thought\text{chain of thought}chain of thought情况下语言模型也可以完成。作者从从姓名普查数据中排名前1000位的名和姓中随机拼接姓名来生成全名。
硬币反转

该任务要求模型回答在人类在抛或者不抛硬币后，硬币是否仍然朝上。(例如：一个硬币朝上。Phoebe反转了硬币，而Osvaldo没有反转，那么硬币是否仍然朝上? -> no)。

由于这些符号推理任务构造过程是明确的，考虑一个领域内测试集，样本与训练/few-shot\text{few-shot}few-shot示例具有相同的步数；一个领域外测试集(OOD)\text{(OOD)}(OOD)，其评估步数要比示例更多。对于末尾字符拼接，模型仅见过带有两个单词的名字示例，随后会在具有3和4个单词的名字上执行末尾字符拼接任务。在硬币反转任务中，潜在反转次数上进行相同操作。为每个任务人工构造了用于few-shot\text{few-shot}few-shot示例的chain of thought\text{chain of thought}chain of thought。

2. 结果

上图是PaLM\text{PaLM}PaLM在领域内和领域外评估的结果。对于模型PaLM 540B\text{PaLM 540B}PaLM 540B，chain-of-thought prompting\text{chain-of-thought prompting}chain-of-thought prompting几乎能够带来接近100%的解决率。此外，这些领域内的评估是“toy tasks”，某种意义上完美的解决结构早已经通过few-shot\text{few-shot}few-shot示例的chain of thought\text{chain of thought}chain of thought提供了。在测试时，所有模型在新符号上必须重复相同的步骤。然而，小模型仍然会失败，在这三个任务中对于未见过的符号进行抽象操作能力只会出现在超过100B\text{100B}100B的模型参数。

对于领域外评估，标准的prompting\text{prompting}prompting在两个任务上都失败了。使用chain of thought prompting\text{chain of thought prompting}chain of thought prompting语言模型实现了好的效果。因此，对于足够规模的语言模型，chain of thought prompting\text{chain of thought prompting}chain of thought prompting有助于长度泛化自未见过的chain of thought\text{chain of thought}chain of thought。