提高OpenAI可靠性的技术

当GPT-3在任务中失败时，你该怎么办?

寻找一个更好的提示，引出更可靠的答案?
投资数千个示例来微调自定义模型?
假设模型无法完成任务，然后继续?

没有简单的答案，要视情况而定。但是，如果您的任务涉及逻辑推理或复杂性，请考虑尝试本文中的技术来构建更可靠、高性能的提示。

为什么GPT-3在复杂任务中失败

如果让你用13乘以17，你会马上想到答案吗?对我们大多数人来说，可能不会。然而，这并不意味着人类不能做两位数的乘法。只需要几秒钟，一些笔和纸，就可以算出13 x 17 = 130 + 70 + 21 = 221。

类似地，如果你给GPT-3一个太复杂的任务，它无法在计算下一个令牌所需的时间内完成，它可能会推测出一个错误的猜测。然而，与人类类似，这并不一定意味着模型无法完成任务。有了一定的时间和空间来推理，模型可能仍然能够可靠地回答问题。

举个例子，如果你问text- davici -002下面这个关于杂耍球的数学题，它的答案是错误的:

Q: A juggler has 16 balls. Half of the balls are golf balls and half of the golf balls are blue. How many blue golf balls are there?
A:

There are 8 blue golf balls.

这是否意味着GPT-3不能做简单的数学题?没有;实际上，通过用Let’s think step by step来提示模型，模型可靠地解决了问题:

Q: A juggler has 16 balls. Half of the balls are golf balls and half of the golf balls are blue. How many blue golf balls are there?
A: Let's think step by step.

There are 16 balls in total. 
Half of the balls are golf balls. 
That means that there are 8 golf balls. 
Half of the golf balls are blue. 
That means that there are 4 blue golf balls.

当然，仅仅从一个例子中很难判断这个循序渐进的技巧是否在一般情况下是有效的还是在这个特殊的问题上是幸运的。但它确实有效。在单词数学问题的基准测试中，“Let's think step by step”的技巧大大提高了GPT-3的解决率，从毫无价值的18%提高到体面的79%!

模型功能依赖于上下文

在学习使用GPT-3时，一个常见的概念错误是认为它的功能在所有上下文中都是固定的。例如，如果GPT-3答错了一个简单的逻辑问题，那么它一定没有能力进行简单的逻辑测试。

但是，正如“Let's think step by step”的例子所说明的那样，GPT-3的明显失败有时可以通过更好的提示来纠正，从而帮助模型将自己导向正确的输出。

如何提高复杂任务的可靠性

本文的其余部分将分享在复杂任务上提高大型语言模型可靠性的技术。虽然有些技术是专门针对某些类型的问题，但它们中的许多是建立在可以应用于广泛任务的一般原则之上的，例如:

给出更明确的指示
将复杂的任务拆分为更简单的子任务
构建指令以使模型保持在任务上
在回答之前提示模型解释
为许多可能的答案寻找理由，然后综合
生成许多输出，然后使用模型选择最佳输出
微调自定义模型，以最大限度地提高性能

把复杂的任务分解成简单的任务

给模型更多时间和空间思考的一种方法是将任务分解成更简单的部分。

例如，考虑一个任务，我们问模型一个关于一些文本的多项选择题——在这种情况下，是一个Clue游戏。当被直接询问时，text- davici -002无法将线索3和5放在一起，并错误地回答:

Use the following clues to answer the following multiple-choice question.
 
Clues:
1. Miss Scarlett was the only person in the lounge.
2. The person with the pipe was in the kitchen.
3. Colonel Mustard was the only person in the observatory.
4. Professor Plum was not in the library nor the billiard room.
5. The person with the candlestick was in the observatory.
 
Question: Was Colonel Mustard in the observatory with the candlestick?
(a) Yes; Colonel Mustard was in the observatory with the candlestick
(b) No; Colonel Mustard was not in the observatory with the candlestick
(c) Unknown; there is not enough information to determine whether Colonel Mustard was in the observatory with the candlestick
 
Solution:

(c) Unknown; there is not enough information to determine whether Colonel Mustard was in the observatory with the candlestick

虽然线索3和线索5证实了穆斯塔德上校是天文台里唯一的人，而且天文台里的人有烛台，但模型未能将它们结合成正确的答案(a)是。

然而，与其直接问答案，我们可以把任务分成三个部分:

首先，一条一条地浏览线索，并考虑这些线索是否具有潜在的相关性
其次，结合相关线索推理出问题的答案
第三，写出最终答案:(a)， (b)，或(c)

Use the following clues to answer the following multiple-choice question, using the following procedure:
(1) First, go through the clues one by one and consider whether the clue is potentially relevant
(2) Second, combine the relevant clues to reason out the answer to the question
(3) Third, map the answer to one of the multiple choice answers: either (a), (b), or (c)
 
Clues:
1. Miss Scarlett was the only person in the lounge.
2. The person with the pipe was in the kitchen.
3. Colonel Mustard was the only person in the observatory.
4. Professor Plum was not in the library nor the billiard room.
5. The person with the candlestick was in the observatory.
 
Question: Was Colonel Mustard in the observatory with the candlestick?
(a) Yes; Colonel Mustard was in the observatory with the candlestick
(b) No; Colonel Mustard was not in the observatory with the candlestick
(c) Unknown; there is not enough information to determine whether Colonel Mustard was in the observatory with the candlestick
 
Solution:
(1) First, go through the clues one by one and consider whether the clue is potentially relevant:

Clue 1: Miss Scarlett was the only person in the lounge. This clue is not relevant.
Clue 2: The person with the pipe was in the kitchen. This clue is not relevant.
Clue 3: Colonel Mustard was the only person in the observatory. This clue is relevant.
Clue 4: Professor Plum was not in the library nor the billiard room. This clue is not relevant.
Clue 5: The person with the candlestick was in the observatory. This clue is relevant.
 
(2) Second, combine the relevant clues to reason out the answer to the question:
Clue 3 and Clue 5 both say that the person with the candlestick was in the observatory. Colonel Mustard was the only person in the observatory, so Colonel Mustard was the person with the candlestick.
 
(3) Third, write the final answer:
(a) Yes; Colonel Mustard was in the observatory with the candlestick

通过给模型更多的思考时间和空间，并引导它沿着一个推理计划，它能够找出(a) Yes的正确答案。

将复杂指令拆分为更小的子任务的另一个好处是，它可以帮助模型专注于每个子任务。

例如，如果我们要求text- davincii -002用原文总结一篇文章，模型就会变回英语:

Summarize the text using the original language of the text. The summary should be one sentence long.

Text:
"""
La estadística (la forma femenina del término alemán Statistik, derivado a su vez del italiano statista, "hombre de Estado") es una ciencia que estudia la variabilidad, colección, organización, análisis, interpretación, y presentación de los datos, así como el proceso aleatorio que los genera siguiendo las leyes de la probabilidad. La estadística es una ciencia formal deductiva, con un conocimiento propio, dinámico y en continuo desarrollo obtenido a través del método científico formal. En ocasiones, las ciencias fácticas necesitan utilizar técnicas estadísticas durante su proceso de investigación factual, con el fin de obtener nuevos conocimientos basados en la experimentación y en la observación. En estos casos, la aplicación de la estadística permite el análisis de datos provenientes de una muestra representativa, que busca explicar las correlaciones y dependencias de un fenómeno físico o natural, de ocurrencia en forma aleatoria o condicional.
"""

Summary:

The text explains that statistics is a science that studies the variability, collection, organization, analysis, interpretation, and presentation of data, as well as the random process that generates them following the laws of probability.

但是，如果我们先让模型识别文本的语言，然后对文本进行总结，它就变得更可靠了:

First, identify the language of the text. Second, summarize the text using the original language of the text. The summary should be one sentence long.

Text:
"""
La estadística (la forma femenina del término alemán Statistik, derivado a su vez del italiano statista, "hombre de Estado") es una ciencia que estudia la variabilidad, colección, organización, análisis, interpretación, y presentación de los datos, así como el proceso aleatorio que los genera siguiendo las leyes de la probabilidad. La estadística es una ciencia formal deductiva, con un conocimiento propio, dinámico y en continuo desarrollo obtenido a través del método científico formal. En ocasiones, las ciencias fácticas necesitan utilizar técnicas estadísticas durante su proceso de investigación factual, con el fin de obtener nuevos conocimientos basados en la experimentación y en la observación. En estos casos, la aplicación de la estadística permite el análisis de datos provenientes de una muestra representativa, que busca explicar las correlaciones y dependencias de un fenómeno físico o natural, de ocurrencia en forma aleatoria o condicional.
"""

Language:

Spanish

La estadística es una ciencia que estudia la variabilidad, colección, organización, análisis, interpretación, y presentación de los datos, así como el proceso aleatorio que los genera siguiendo las leyes de la probabilidad.

在回答之前提示模型解释

提高答案可靠性的另一项有力技术是，促使模型逐步推导出答案，而不是立即跳到最终答案。通过“大声思考”，模型更有可能得出正确的答案。

Zero-shot

方法

Takeshi Kojima等人在2022年发表的文章中，提示模型推理出答案的最简单方法是简单地在答案前加上Let's think step by step。图2给出了一个例子:

结果

将这个简单的技巧应用到MultiArith数学数据集上，作者发现，让我们一步一步地思考，准确率从18%提高到79%，翻了两番!

影响

尽管“让我们一步一步思考”的技巧在数学问题上很有效，但并不是对所有的任务都有效。作者发现它对多步算术问题、符号推理问题、策略问题和其他推理问题最有帮助。它对简单的数学问题或常识性问题没有帮助，估计对许多其他非推理任务也没有帮助。

如果您将此技术应用于您自己的任务，请不要害怕尝试自定义指令。让我们考虑一步一步是相当通用的，因此您可能会发现使用遵循针对您的用例定制的更严格格式的指令会有更好的性能。例如，您可以尝试更结构化的变体，例如，首先，一步一步地思考为什么X可能是正确的。其次，一步一步地思考为什么Y可能是正确的。第三，一步一步地思考X和Y哪个更有意义。你甚至可以给模型一个示例格式，以帮助保持它的轨道，例如:

Using the IRS guidance below, answer the following questions using this format:
(1) For each criterion, determine whether it is met by the vehicle purchase
- {Criterion} Let's think step by step. {explanation} {yes or no, or if the question does not apply then N/A}.
(2) After considering each criterion in turn, phrase the final answer as "Because of {reasons}, the answer is likely {yes or no}."

IRS guidance:
"""
You may be eligible for a federal tax credit under Section 30D if you purchased a car or truck that meets the following criteria:
- Does the vehicle have at least four wheels?
- Does the vehicle weigh less than 14,000 pounds?
- Does the vehicle draw energy from a battery with at least 4 kilowatt hours that may be recharged from an external source?
- Was the vehicle purchased in a year before 2022?
  - If so, has the manufacturer sold less than 200,000 qualifying vehicles? (Tesla and GM have sold more than 200,000 qualifying vehicles.)
- Was the vehicle purchased in a year after 2022?
  - If so, is the vehicle present in the following list of North American-assembled vehicles? (The only electric vehicles assembled in North America are the Audi Q5, BMW 330e, BMW X5, Chevrolet Bolt EUV, Chevrolet Bolt EV, Chrysler Pacifica PHEV, Ford Escape PHEV, Ford F Series, Ford Mustang MACH E, Ford Transit Van, GMC Hummer Pickup, GMC Hummer SUV, Jeep Grand Cherokee PHEV, Jeep Wrangler PHEV, Lincoln Aviator PHEV, Lincoln Corsair Plug-in, Lucid Air, Nissan Leaf, Rivian EDV, Rivian R1S, Rivian R1T, Tesla Model 3, Tesla Model S, Tesla Model X, Tesla Model Y, Volvo S60, BMW 330e, Bolt EV, Cadillac Lyriq, Mercedes EQS SUV, and Nissan Leaf.)
"""

Question: Can I claim a federal tax credit for my Toyota Prius Prime bought in 2021?

Solution:

(1) For each criterion, determine whether it is met by the vehicle purchase
- Does the vehicle have at least four wheels? Let's think step by step.

 The Toyota Prius Prime has four wheels, so the answer is yes.
- Does the vehicle weigh less than 14,000 pounds? Let's think step by step. The Toyota Prius Prime weighs less than 14,000 pounds, so the answer is yes.
- Does the vehicle draw energy from a battery with at least 4 kilowatt hours that may be recharged from an external source? Let's think step by step. The Toyota Prius Prime has a battery with at least 4 kilowatt hours that may be recharged from an external source, so the answer is yes.
- Was the vehicle purchased in a year before 2022? Let's think step by step. The Toyota Prius Prime was purchased in 2021, which is before 2022, so the answer is yes.
- Was the vehicle purchased in a year after 2022? N/A
- If so, is the vehicle present in the following list of North American-assembled vehicles? N/A

(2) After considering each criterion in turn, phrase the final answer as "Because of {reasons}, the answer is likely {yes or no}."

Because the Toyota Prius Prime meets all of the criteria for a federal tax credit, the answer is likely yes.

Few-shot 例子

方法

可以通过多种方式促使模型推理出答案。一种方法是用几个例子(“few-shot”)来演示，正如来自谷歌的Jason Wei和Denny Zhou等人所研究的那样。以下是几个简短的思维链提示:

更多由人类标注者编写的推理链演示:

结果

在对小学数学题的测试中，作者发现，思维链提示使解题率增加了两倍，从18%增加到57%。

除了数学问题外，思维链提示也提高了学生在体育理解、掷硬币追踪和最后一个字母串联等问题上的表现。在大多数情况下，不需要太多的示例就可以实现性能增益(少于8个左右)。

影响

相对于Let's think by step技术，基于少量示例的方法的一个优点是，您可以更容易地指定您希望模型在得到最终答案之前执行的推理的格式、长度和风格。在模型最初没有以正确的方式或深度进行推理的情况下，这尤其有用。

Fine-tuned

方法

一般来说，为了在任务中获得最大的性能，您需要对自定义模型进行微调。然而，使用解释对模型进行微调可能需要数千个示例解释，编写这些解释的成本很高。

在2022年，Eric Zelikman和Yuhuai Wu等人发表了一个聪明的程序，使用几个镜头提示来生成一个解释数据集，可以用来微调模型。这个想法是使用几个提示来生成候选解释，并且只保留产生正确答案的解释。然后，为了获得对一些错误答案的额外解释，请重试少数几个提示，但将正确答案作为问题的一部分给出。作者称他们的程序为STaR(自学推理器):

使用这种技术，您可以将微调的好处与思维链提示的好处结合起来，而无需编写数千个示例解释。

结果

当作者将这种技术应用于常识问答数据集时，他们发现STaR优于单独的思维链提示(73% >37%)和微调(73% >60%):

影响

使用少量提示符来扩展或修改微调数据集是一种可以推广的想法，而不是写解释。例如，如果您有大量想要进行训练的非结构化文本，您可能会发现有机会使用提示符从非结构化文本中提取结构化数据集，然后在该结构化数据集上微调自定义模型。

扩展到思维链提示

一些思维链提示的扩展也已经出版。

Selection-inference提示

方法

由Antonia Creswell等人出版的，思维链技术的一个扩展是将生成解释和答案的单个提示拆分成更小的部分。首先，提示符从文本中选择相关的事实子集(“选择提示符”)。然后，第二个提示从选定的事实中推断出结论(“推理提示”)。然后，这些提示在循环中交替产生多个推理步骤，并最终得出最终答案。作者用下图说明了这个想法:

结果

当应用于7b参数模型时，作者发现，相对于在bAbi和Proof Writer基准任务(两者都需要更长的推理步骤序列)上的思维链提示，选择推理提示大大提高了性能。他们获得的最佳性能将选择推理提示和微调结合在一起。

影响

尽管这些基准测试的收益很大，但之所以选择这些基准测试，是因为它们需要更长的推理序列。对于不需要很多步骤推理的问题，收益可能较小。

结果突出了处理大型语言模型的一些一般经验。首先，将复杂的任务分解成更小的任务是提高可靠性和性能的好方法;任务越原子化，模型出错的空间就越小。第二，获得最佳性能通常意味着将微调与所选择的方法相结合。

忠实的推理架构

在发表选择推理提示技术几个月后，作者在后续论文中扩展了该技术，并提出了以下想法:

确定选择-推理周期何时应该停止或继续
增加一个值函数来帮助搜索多个推理路径
通过微调模型来推断句子标签(例如，sen1)，而不是写出句子本身，从而减少对虚假事实的幻觉

方法
在最初的选择-推理技术中，专门的“选择”和“推理”提示交替选择事实并从这些事实中进行推理，结合起来产生一系列推理步骤。
作者用两个附加组件扩展了该技术。

首先，作者添加了一个“笼头”模型，在每个推理步骤之后，询问到目前为止的推理是否足以回答问题。如果是，则模型生成最终答案。

吊带模式带来了几个优点:

它可以根据需要告诉选择推理过程停止或继续进行。
如果这个过程永远不会停止，你将得不到答案，这通常比幻觉猜测更可取

其次，作者增加了一个价值函数，用于评估推理步骤的质量并在多个推理轨迹上进行搜索。这与提高可靠性的共同主题相呼应;不是从模型中生成单个答案，而是生成一组答案，然后使用某种类型的值函数/鉴别器/验证器模型来选择最佳答案。

除了这两个扩展，作者还使用了一个技巧来减少对虚假事实的幻觉。他们没有要求模型写出事实句子，而是对模型进行微调，使其与句子标签(例如，sen1)一起工作。这有助于防止模型产生提示上下文中未提及的虚假事实的幻觉。

结果

作者在两个基准上评估了他们的技术:ProofWriter任务(未显示)和EntailmentBankQA(显示)。该技术大大提高了准确性，特别是在较难的推理问题上。

此外，他们的句子标签操纵技巧基本上消除了幻觉!

影响
本文阐述了一些有助于提高大型语言模型可靠性的经验教训:

将复杂的任务分成更小、更可靠的子任务
以循序渐进的方式生成答案，并在此过程中进行评估
生成许多可能的答案，并使用另一个模型或函数来选择看起来最好的答案
通过限制模型能说什么来减少幻觉(例如，用句子标签代替句子)
通过在特定任务上对模型进行微调来最大化模型的性能

Least-to-most prompting

除了在长推理链(选择推理的优势)上表现不佳之外，当示例很短而任务很长时，思维链提示尤其会遇到困难。

方法

最少到最多提示是另一种技术，它将推理任务分成更小、更可靠的子任务。这个想法是通过提示模型来引出一个子任务，比如“to solve {question}， we need to first solve:”。然后，有了这个子任务，模型就可以生成一个解决方案。解决方案被附加到原来的问题，并重复这个过程，直到产生最终答案。

结果

当使用code- davincii -002(针对代码进行了优化，但仍然可以理解文本)应用于涉及长推理链的基准测试时，作者测量的收益高达16% -> 99.7% !

影响

尽管上述从最少提示到最多提示的收益令人印象深刻，但它们是在一组非常狭窄的任务上衡量的，这些任务需要很长的推理链。

尽管如此，它们说明了一个共同的主题:通过(a)将复杂的任务分解成更小的子任务和(b)给模型更多的时间和空间来求解答案来提高可靠性。