type
status
date
slug
summary
tags
category
icon
password
The Practice and Lore of Neural Net Training神经网络训练的实践与知识
特别是在过去的十年里,神经网络训练的艺术取得了许多进展。是的,这基本上是一门艺术。有时候,尤其是事后,人们可以看到至少有一点“科学解释”来解释正在进行的事情。但大多数情况下,这些发现都是通过反复试验、添加想法和技巧来逐步建立了关于如何处理神经网络的重要知识。
有几个关键部分。首先,关键是要考虑为特定任务使用什么样的神经网络架构。然后是如何获取用于训练神经网络的数据的关键问题。而且,越来越多的情况下,不再需要从头开始训练网络:相反,一个新网络可以直接整合另一个已经训练好的网络,或者至少可以使用该网络为自己生成更多的训练示例。
人们可能会认为,对于每一种特定的任务,都需要不同的神经网络架构。但事实上发现,同样的架构通常似乎适用于明显不同的任务。在某种程度上,这让人想起了通用计算的概念(以及我的计算等价原则),但正如我将在后面讨论的那样,我认为这更多地反映了我们通常试图让神经网络执行的任务是“类人”的任务,而神经网络可以捕捉相当普遍的“类人过程”。
在神经网络的早期,人们倾向于认为应该“让神经网络尽可能少地工作”。例如,在将语音转换为文本时,人们认为应该首先分析语音的音频,将其分解为音素等。但发现的是,至少对于“类人任务”,通常最好的方法是尝试训练神经网络解决“端到端问题”,让它自己“发现”必要的中间特征、编码等。
还有一个想法,即应该将复杂的个体组件引入神经网络中,使其实际上“明确实现特定的算法思想”。但再次发现,这在大多数情况下并不值得;相反,最好只处理非常简单的组件,并让它们“自行组织”(尽管通常以我们无法理解的方式)来实现(可能是)那些算法思想的等效物。
这并不意味着对神经网络没有相关的“结构性思想”。例如,在处理图像的早期阶段,具有局部连接的神经元的 2D 数组似乎至少非常有用。而且,集中于“按顺序回顾”的连接模式似乎很有用,正如我们稍后将看到的,用于处理人类语言等事物,例如在 ChatGPT 中。
但神经网络的一个重要特征是,就像计算机一样,它们最终只是在处理数据。而当前的神经网络——以及当前的神经网络训练方法——具体处理的是数字数组。但在处理过程中,这些数组可以被完全重新排列和重塑。举个例子,我们用于识别上面数字的网络从一个 2D“图像样式”数组开始,迅速“加厚”到许多通道,然后“浓缩”成一个最终包含代表不同可能输出数字的元素的 1D 数组。
但神经网络的一个重要特征是,就像计算机一样,它们最终只是在处理数据。
但是,好吧,一个人如何能够确定为特定任务需要多大的神经网络呢?这有点像一门艺术。在某种程度上,关键是要知道“任务有多难”。但对于类似人类的任务,通常很难估计。是的,可能有一种系统化的方法可以通过计算机“机械地”完成任务。但很难知道是否有可以让人以至少“类似人类水平”更轻松地完成任务的技巧或捷径。可能需要列举一个巨大的游戏树来“机械地”玩某个游戏;但可能有一种更简单(“启发式”)的方法来实现“人类水平的游戏”。
当处理微小神经网络和简单任务时,有时可以明确看到“无法从这里到达那里”。例如,在上一节的任务中,似乎最好的做法是使用几个小型神经网络:
我们看到的是,如果神经网络太小,它就无法复制我们想要的功能。但是在某个大小以上,它就没有问题——至少如果训练足够长时间,提供足够的例子。顺便说一句,这些图片说明了神经网络传统的一部分:如果中间有一个“挤压”,强制所有内容通过较少的中间神经元,通常可以使用较小的网络。值得一提的是,“无中间层”或所谓的“感知器”网络只能学习基本线性函数,但只要有一个中间层,原则上总是可能以任意好的方式逼近任何函数,至少如果有足够的神经元,尽管为了使其可行训练,通常需要一些正则化或归一化。
好的,假设某人已经确定了某种神经网络架构。现在问题在于获取用于训练网络的数据。神经网络以及机器学习的许多实际挑战都集中在获取或准备必要的训练数据上。在许多情况下(“监督学习”),人们希望获得输入的明确示例以及期望从中获得的输出。因此,例如,一个人可能希望对图像进行标记,或者标记其他属性。也许一个人将不得不明确地经过一番努力进行标记。但很多时候,事实证明可以依赖已经完成的工作,或者将其用作某种代理。例如,一个人可能会使用网络上为图像提供的 alt 标签。或者在不同领域,一个人可能会使用为视频创建的闭路字幕。或者在语言翻译训练中,一个人可能会使用存在于不同语言中的网页或其他文档的平行版本。
How much data do you need to show a neural net to train it for a particular task? Again, it’s hard to estimate from first principles. Certainly the requirements can be dramatically reduced by using “transfer learning” to “transfer in” things like lists of important features that have already been learned in another network. But generally neural nets need to “see a lot of examples” to train well. And at least for some tasks it’s an important piece of neural net lore that the examples can be incredibly repetitive. And indeed it’s a standard strategy to just show a neural net all the examples one has, over and over again. In each of these “training rounds” (or “epochs”) the neural net will be in at least a slightly different state, and somehow “reminding it” of a particular example is useful in getting it to “remember that example”. (And, yes, perhaps this is analogous to the usefulness of repetition in human memorization.)
你需要展示多少数据给神经网络来训练它完成特定任务?再次强调,从第一原则出发很难估计。当然,通过使用“迁移学习”可以大大减少需求,将已经在另一个网络中学习过的重要特征列表“迁入”。但通常神经网络需要“看到很多例子”才能训练良好。对于一些任务来说,神经网络传统上需要看到大量重复的例子。事实上,一个标准策略是反复向神经网络展示所有已有的例子。在每个“训练轮次”(或“时代”)中,神经网络至少会处于稍微不同的状态,某种方式“提醒它”某个特定例子对于让它“记住那个例子”是有用的。(是的,也许这类似于人类记忆中重复的有用性。)
But often just repeating the same example over and over again isn’t enough. It’s also necessary to show the neural net variations of the example. And it’s a feature of neural net lore that those “data augmentation” variations don’t have to be sophisticated to be useful. Just slightly modifying images with basic image processing can make them essentially “as good as new” for neural net training. And, similarly, when one’s run out of actual video, etc. for training self-driving cars, one can go on and just get data from running simulations in a model videogame-like environment without all the detail of actual real-world scenes.
但通常仅仅重复相同的例子并不足够。还需要展示示例的神经网络变化。神经网络传说的一个特点是,这些“数据增强”变化并不一定要复杂才能有用。只需用基本图像处理轻微修改图像,就可以使它们在神经网络训练中基本上“焕然一新”。同样,当一个人用于自动驾驶汽车训练的实际视频等资源用尽时,可以继续从在类似模拟视频游戏环境中运行的模拟中获取数据,而无需所有实际现实场景的细节。
How about something like ChatGPT? Well, it has the nice feature that it can do “unsupervised learning”, making it much easier to get it examples to train from. Recall that the basic task for ChatGPT is to figure out how to continue a piece of text that it’s been given. So to get it “training examples” all one has to do is get a piece of text, and mask out the end of it, and then use this as the “input to train from”—with the “output” being the complete, unmasked piece of text. We’ll discuss this more later, but the main point is that—unlike, say, for learning what’s in images—there’s no “explicit tagging” needed; ChatGPT can in effect just learn directly from whatever examples of text it’s given.ChatGPT 这样的东西怎么样?嗯,它有一个很好的特性,即它可以进行“无监督学习”,这样更容易为其提供训练示例。回想一下,ChatGPT 的基本任务是找出如何继续给定的一段文本。因此,要获得“训练示例”,我们只需获取一段文本,将其末尾掩盖,然后将其用作“训练输入”,“输出”是完整的、未掩盖的文本。我们稍后会更详细地讨论这一点,但主要观点是,与学习图像内容不同,不需要“显式标记”;ChatGPT 实际上可以直接从其获得的文本示例中学习。
OK, so what about the actual learning process in a neural net? In the end it’s all about determining what weights will best capture the training examples that have been given. And there are all sorts of detailed choices and “hyperparameter settings” (so called because the weights can be thought of as “parameters”) that can be used to tweak how this is done. There are different choices of loss function (sum of squares, sum of absolute values, etc.). There are different ways to do loss minimization (how far in weight space to move at each step, etc.). And then there are questions like how big a “batch” of examples to show to get each successive estimate of the loss one’s trying to minimize. And, yes, one can apply machine learning (as we do, for example, in Wolfram Language) to automate machine learning—and to automatically set things like hyperparameters.好的,那么神经网络中的实际学习过程是怎样的呢?最终,一切都在于确定哪些权重能最好地捕捉到已给出的训练示例。有各种详细的选择和“超参数设置”(因为权重可以被视为“参数”),可以用来调整这样的操作方式。有不同的损失函数选择(平方和、绝对值和等)。有不同的损失最小化方法(在每一步中在权重空间中移动多远等)。然后还有一些问题,比如展示多大“批量”的示例以获得每个连续估计的损失,以及如何最小化。是的,可以应用机器学习(例如在 Wolfram 语言中所做的)来自动化机器学习,并自动设置超参数等事项。
But in the end the whole process of training can be characterized by seeing how the loss progressively decreases (as in this Wolfram Language progress monitor for a small training):但最终整个训练过程可以通过观察损失如何逐渐减少来描述(如在这个 Wolfram 语言的小训练进度监视器中)
And what one typically sees is that the loss decreases for a while, but eventually flattens out at some constant value. If that value is sufficiently small, then the training can be considered successful; otherwise it’s probably a sign one should try changing the network architecture.通常情况下,人们会看到损失会在一段时间内减少,但最终会趋于稳定在某个常数值。如果该值足够小,则可以认为训练是成功的;否则,这可能是需要尝试更改网络架构的迹象。
Can one tell how long it should take for the “learning curve” to flatten out? Like for so many other things, there seem to be approximate power-law scaling relationships that depend on the size of neural net and amount of data one’s using. But the general conclusion is that training a neural net is hard—and takes a lot of computational effort. And as a practical matter, the vast majority of that effort is spent doing operations on arrays of numbers, which is what GPUs are good at—which is why neural net training is typically limited by the availability of GPUs.有人能说出“学习曲线”要多久才会趋于平缓吗?就像许多其他事情一样,似乎存在着依赖于使用的神经网络大小和数据量的近似幂律缩放关系。但总的结论是,训练神经网络很困难,需要大量的计算工作。而实际上,绝大部分的工作都是在数字数组上进行操作,而这正是 GPU 擅长的领域,这也是为什么神经网络训练通常受到 GPU 可用性的限制。
In the future, will there be fundamentally better ways to train neural nets—or generally do what neural nets do? Almost certainly, I think. The fundamental idea of neural nets is to create a flexible “computing fabric” out of a large number of simple (essentially identical) components—and to have this “fabric” be one that can be incrementally modified to learn from examples. In current neural nets, one’s essentially using the ideas of calculus—applied to real numbers—to do that incremental modification. But it’s increasingly clear that having high-precision numbers doesn’t matter; 8 bits or less might be enough even with current methods.在未来,是否会有根本更好的方法来训练神经网络,或者一般地做神经网络所做的事情?我几乎可以肯定。神经网络的基本理念是利用大量简单(基本上相同)的组件创建一个灵活的“计算结构”,并使这个“结构”能够逐步修改以从示例中学习。在当前的神经网络中,基本上是使用微积分的思想——应用于实数——来进行这种逐步修改。但越来越明显的是,高精度数字并不重要;即使使用当前的方法,8 位或更少可能已经足够。
With computational systems like cellular automata that basically operate in parallel on many individual bits it’s never been clear how to do this kind of incremental modification, but there’s no reason to think it isn’t possible. And in fact, much like with the “deep-learning breakthrough of 2012” it may be that such incremental modification will effectively be easier in more complicated cases than in simple ones.像元胞自动机这样的计算系统基本上是并行地在许多个体位上运行,如何进行这种渐进式修改一直不太清楚,但没有理由认为这是不可能的。实际上,就像“2012 年的深度学习突破”一样,这种渐进式修改在更复杂的情况下可能会比简单情况下更容易。
Neural nets—perhaps a bit like brains—are set up to have an essentially fixed network of neurons, with what’s modified being the strength (“weight”) of connections between them. (Perhaps in at least young brains significant numbers of wholly new connections can also grow.) But while this might be a convenient setup for biology, it’s not at all clear that it’s even close to the best way to achieve the functionality we need. And something that involves the equivalent of progressive network rewriting (perhaps reminiscent of our Physics Project) might well ultimately be better.神经网络——或许有点像大脑——被设置为具有基本固定的神经元网络,修改的是它们之间连接的强度(“权重”)。 (也许至少在年轻大脑中,完全新的连接也可以增长。)但是,虽然这对生物学可能是一个方便的设置,但并不清楚这是否接近我们需要实现功能的最佳方式。 而涉及等效于渐进网络重写的东西(也许让人想起我们的物理项目)最终可能会更好。
But even within the framework of existing neural nets there’s currently a crucial limitation: neural net training as it’s now done is fundamentally sequential, with the effects of each batch of examples being propagated back to update the weights. And indeed with current computer hardware—even taking into account GPUs—most of a neural net is “idle” most of the time during training, with just one part at a time being updated. And in a sense this is because our current computers tend to have memory that is separate from their CPUs (or GPUs). But in brains it’s presumably different—with every “memory element” (i.e. neuron) also being a potentially active computational element. And if we could set up our future computer hardware this way it might become possible to do training much more efficiently.但即使在现有神经网络的框架内,目前存在一个关键限制:神经网络训练目前基本上是顺序进行的,每一批示例的影响都会被传播回去更新权重。事实上,即使考虑到 GPU,当前的计算机硬件在训练过程中大部分时间都是“空闲”的,只有一部分在更新。从某种意义上说,这是因为我们当前的计算机往往具有与 CPU(或 GPU)分开的内存。但在大脑中,情况可能不同——每个“记忆元素”(即神经元)也可能是一个潜在的活跃计算元素。如果我们能够这样设置未来的计算机硬件,可能会更有效地进行训练。
“Surely a Network That’s Big Enough Can Do Anything!”“一个足够大的网络肯定可以做任何事情!”
The capabilities of something like ChatGPT seem so impressive that one might imagine that if one could just “keep going” and train larger and larger neural networks, then they’d eventually be able to “do everything”. And if one’s concerned with things that are readily accessible to immediate human thinking, it’s quite possible that this is the case. But the lesson of the past several hundred years of science is that there are things that can be figured out by formal processes, but aren’t readily accessible to immediate human thinking.像 ChatGPT 这样的东西的能力似乎令人印象深刻,人们可能会想象,如果能够不断“继续”并训练更大更大的神经网络,那么最终它们将能够“做任何事情”。如果一个人关心的是那些对立即人类思维可读的事物,这种情况是很可能的。但过去几百年科学的教训是,有些事情可以通过正式过程找出,但对立即人类思维并不容易理解。
Nontrivial mathematics is one big example. But the general case is really computation. And ultimately the issue is the phenomenon of computational irreducibility. There are some computations which one might think would take many steps to do, but which can in fact be “reduced” to something quite immediate. But the discovery of computational irreducibility implies that this doesn’t always work. And instead there are processes—probably like the one below—where to work out what happens inevitably requires essentially tracing each computational step:非平凡的数学是一个很好的例子。但一般情况下,真正的问题是计算。最终问题是计算不可简化现象。有一些计算,人们可能认为需要很多步骤才能完成,但实际上可以“简化”为某种相当直接的东西。但计算不可简化的发现意味着这并不总是奏效。相反,有一些过程——可能像下面这个——需要基本上追踪每个计算步骤才能弄清楚发生了什么:
The kinds of things that we normally do with our brains are presumably specifically chosen to avoid computational irreducibility. It takes special effort to do math in one’s brain. And it’s in practice largely impossible to “think through” the steps in the operation of any nontrivial program just in one’s brain.我们通常用大脑做的事情,可能是特意选择的,以避免计算的不可简化性。在大脑中进行数学运算需要特别的努力。而且在实践中,仅凭大脑“思考”任何非平凡程序操作的步骤几乎是不可能的。
But of course for that we have computers. And with computers we can readily do long, computationally irreducible things. And the key point is that there’s in general no shortcut for these.但是当然,我们有计算机。有了计算机,我们可以轻松地做长时间的、计算上不可简化的事情。关键点是,通常这些事情没有捷径。
Yes, we could memorize lots of specific examples of what happens in some particular computational system. And maybe we could even see some (“computationally reducible”) patterns that would allow us to do a little generalization. But the point is that computational irreducibility means that we can never guarantee that the unexpected won’t happen—and it’s only by explicitly doing the computation that you can tell what actually happens in any particular case.是的,我们可以记住许多特定计算系统中发生的具体例子。也许我们甚至可以看到一些(“计算可简化”)模式,使我们能够进行一些泛化。但关键是,计算不可简化意味着我们永远无法保证意外不会发生 - 只有通过明确进行计算,您才能知道在任何特定情况下实际发生了什么。
And in the end there’s just a fundamental tension between learnability and computational irreducibility. Learning involves in effect compressing data by leveraging regularities. But computational irreducibility implies that ultimately there’s a limit to what regularities there may be.最后,学习能力和计算不可简化之间存在根本的张力。学习实际上涉及通过利用规律性来压缩数据。但计算不可简化意味着最终存在规律性可能性的限制。
As a practical matter, one can imagine building little computational devices—like cellular automata or Turing machines—into trainable systems like neural nets. And indeed such devices can serve as good “tools” for the neural net—like Wolfram|Alpha can be a good tool for ChatGPT. But computational irreducibility implies that one can’t expect to “get inside” those devices and have them learn.作为一个实际问题,人们可以想象将小型计算设备(如细胞自动机或图灵机)构建到像神经网络这样的可训练系统中。事实上,这样的设备可以作为神经网络的良好“工具”——就像 Wolfram|Alpha 可以成为 ChatGPT 的良好工具一样。但计算不可简化意味着人们不能指望“进入”这些设备并让它们学习。
Or put another way, there’s an ultimate tradeoff between capability and trainability: the more you want a system to make “true use” of its computational capabilities, the more it’s going to show computational irreducibility, and the less it’s going to be trainable. And the more it’s fundamentally trainable, the less it’s going to be able to do sophisticated computation.换句话说,能力和可训练性之间存在着最终的权衡:您希望系统充分利用其计算能力,它就会表现出计算不可简化性,从而变得不太可训练。而基本上可训练的系统,就越难进行复杂计算。
(For ChatGPT as it currently is, the situation is actually much more extreme, because the neural net used to generate each token of output is a pure “feed-forward” network, without loops, and therefore has no ability to do any kind of computation with nontrivial “control flow”.)对于当前的 ChatGPT 来说,情况实际上更加极端,因为用于生成每个输出标记的神经网络是纯粹的“前馈”网络,没有循环,因此无法进行任何非平凡“控制流”计算。
Of course, one might wonder whether it’s actually important to be able to do irreducible computations. And indeed for much of human history it wasn’t particularly important. But our modern technological world has been built on engineering that makes use of at least mathematical computations—and increasingly also more general computations. And if we look at the natural world, it’s full of irreducible computation—that we’re slowly understanding how to emulate and use for our technological purposes.当然,人们可能会想知道是否真的很重要能够进行不可约计算。事实上,在人类历史的大部分时间里,这并不特别重要。但我们现代技术世界是建立在利用至少数学计算的工程基础上的,而且越来越多地也使用更一般的计算。如果我们观察自然界,它充满了不可约计算,我们正在逐渐理解如何模拟和利用它们来实现我们的技术目的。
Yes, a neural net can certainly notice the kinds of regularities in the natural world that we might also readily notice with “unaided human thinking”. But if we want to work out things that are in the purview of mathematical or computational science the neural net isn’t going to be able to do it—unless it effectively “uses as a tool” an “ordinary” computational system.是的,神经网络肯定可以注意到自然界中我们可能也很容易用“未经辅助的人类思维”注意到的规律。但是,如果我们想解决数学或计算科学范畴内的问题,神经网络是无法做到的,除非它有效地将一个“普通”的计算系统作为“工具”使用。
But there’s something potentially confusing about all of this. In the past there were plenty of tasks—including writing essays—that we’ve assumed were somehow “fundamentally too hard” for computers. And now that we see them done by the likes of ChatGPT we tend to suddenly think that computers must have become vastly more powerful—in particular surpassing things they were already basically able to do (like progressively computing the behavior of computational systems like cellular automata).但所有这一切可能会让人感到困惑。过去有很多任务,包括写文章,我们一直认为对计算机来说“基本上太难了”。现在我们看到像 ChatGPT 这样的人做这些事情,我们往往会突然认为计算机一定变得更加强大——特别是超越了它们已经基本能够做到的事情(比如逐步计算细胞自动机等计算系统的行为)。
But this isn’t the right conclusion to draw. Computationally irreducible processes are still computationally irreducible, and are still fundamentally hard for computers—even if computers can readily compute their individual steps. And instead what we should conclude is that tasks—like writing essays—that we humans could do, but we didn’t think computers could do, are actually in some sense computationally easier than we thought.但这并不是正确的结论。计算上不可简化的过程仍然是计算上不可简化的,对于计算机来说仍然是基本困难的,即使计算机可以轻松计算它们的各个步骤。相反,我们应该得出的结论是,我们人类可以做的任务,比如写文章,我们以前认为计算机无法完成,实际上在某种意义上比我们想象的要容易。
In other words, the reason a neural net can be successful in writing an essay is because writing an essay turns out to be a “computationally shallower” problem than we thought. And in a sense this takes us closer to “having a theory” of how we humans manage to do things like writing essays, or in general deal with language.换句话说,神经网络能够成功地撰写文章的原因是因为撰写文章实际上是一个比我们想象中更“计算较浅”的问题。从某种意义上说,这使我们更接近“拥有一个理论”,即我们人类如何处理撰写文章这类事情,或者一般来说如何处理语言。
If you had a big enough neural net then, yes, you might be able to do whatever humans can readily do. But you wouldn’t capture what the natural world in general can do—or that the tools that we’ve fashioned from the natural world can do. And it’s the use of those tools—both practical and conceptual—that have allowed us in recent centuries to transcend the boundaries of what’s accessible to “pure unaided human thought”, and capture for human purposes more of what’s out there in the physical and computational universe.如果你有一个足够大的神经网络,那么,是的,你可能能够做任何人类可以轻松做到的事情。但你不会捕捉到自然界总体可以做到的事情,或者我们从自然界中制造出来的工具可以做到的事情。正是这些工具的使用——无论是实际的还是概念性的——在最近几个世纪里让我们超越了“纯粹无辅助的人类思维”的界限,并为人类目的捕捉到更多物理和计算宇宙中的事物。
The Concept of Embeddings嵌入的概念
Neural nets—at least as they’re currently set up—are fundamentally based on numbers. So if we’re going to to use them to work on something like text we’ll need a way to
. And certainly we could start (essentially as ChatGPT does) by just assigning a number to every word in the dictionary. But there’s an important idea—that’s for example central to ChatGPT—that goes beyond that. And it’s the idea of “embeddings”. One can think of an embedding as a way to try to represent the “essence” of something by an array of numbers—with the property that “nearby things” are represented by nearby numbers.
神经网络——至少目前的设置是基于数字的。因此,如果我们要用它们来处理文本之类的东西,我们需要一种用数字表示文本的方法。当然,我们可以像 ChatGPT 一样,从简单地为字典中的每个单词分配一个数字开始。但有一个重要的想法——比如 ChatGPT 的核心——超越了这一点。那就是“嵌入”的概念。人们可以将嵌入看作是一种尝试用一组数字来表示某物的“本质”的方法,其中“附近的事物”由附近的数字表示。
And so, for example, we can think of a word embedding as trying to
in which words that are somehow “nearby in meaning” appear nearby in the embedding. The actual embeddings that are used—say in ChatGPT—tend to involve large lists of numbers. But if we project down to 2D, we can show examples of how words are laid out by the embedding:
因此,例如,我们可以将词嵌入视为试图在一种“含义空间”中布置单词,其中在含义上“相近”的单词会在嵌入中附近出现。实际使用的嵌入(比如在 ChatGPT 中)往往涉及大量数字列表。但如果我们投影到二维空间,我们可以展示单词如何被嵌入布置的示例:
And, yes, what we see does remarkably well in capturing typical everyday impressions. But how can we construct such an embedding? Roughly the idea is to look at large amounts of text (here 5 billion words from the web) and then see “how similar” the “environments” are in which different words appear. So, for example, “alligator” and “crocodile” will often appear almost interchangeably in otherwise similar sentences, and that means they’ll be placed nearby in the embedding. But “turnip” and “eagle” won’t tend to appear in otherwise similar sentences, so they’ll be placed far apart in the embedding.而且,我们所看到的在捕捉典型的日常印象方面表现得非常出色。但是我们如何构建这样的嵌入呢?大致的想法是查看大量文本(这里是来自网络的 50 亿字),然后看看不同单词出现在其中的“环境”有多“相似”。例如,“鳄鱼”和“鳄鱼”通常会在其他情况下几乎可以互换使用的句子中出现,这意味着它们会在嵌入中靠近放置。但是“萝卜”和“老鹰”不太可能出现在其他情况下相似的句子中,因此它们在嵌入中会被放置得很远。
But how does one actually implement something like this using neural nets? Let’s start by talking about embeddings not for words, but for images. We want to find some way to characterize images by lists of numbers in such a way that “images we consider similar” are assigned similar lists of numbers.但是如何使用神经网络实际实现这样的东西呢?让我们首先讨论的是嵌入,不是针对单词,而是针对图像。我们希望找到一种方法,通过一系列数字来表征图像,使得“我们认为相似的图像”被分配相似的数字列表。
How do we tell if we should “consider images similar”? Well, if our images are, say, of handwritten digits we might “consider two images similar” if they are of the same digit. Earlier we discussed a neural net that was trained to recognize handwritten digits. And we can think of this neural net as being set up so that in its final output it puts images into 10 different bins, one for each digit.我们如何判断是否应该“考虑图像相似”?嗯,如果我们的图像是手写数字,我们可能会认为两个图像相似,如果它们是相同的数字。之前我们讨论过一个经过训练以识别手写数字的神经网络。我们可以将这个神经网络看作是被设置成在最终输出中将图像放入 10 个不同的箱子,每个箱子对应一个数字。
But what if we “intercept” what’s going on inside the neural net before the final “it’s a ‘4’” decision is made? We might expect that inside the neural net there are numbers that characterize images as being “mostly 4-like but a bit 2-like” or some such. And the idea is to pick up such numbers to use as elements in an embedding.但是如果我们在最终做出“是‘4’”决定之前“拦截”神经网络内部发生的事情呢?我们可能期望神经网络内部有一些数字,用来描述图像“大部分像 4 但有点像 2”之类的特征。而这个想法是提取这些数字,作为嵌入中的元素。
So here’s the concept. Rather than directly trying to characterize “what image is near what other image”, we instead consider a well-defined task (in this case digit recognition) for which we can get explicit training data—then use the fact that in doing this task the neural net implicitly has to make what amount to “nearness decisions”. So instead of us ever explicitly having to talk about “nearness of images” we’re just talking about the concrete question of what digit an image represents, and then we’re “leaving it to the neural net” to implicitly determine what that implies about “nearness of images”.所以这里是概念。与其直接尝试表征“哪个图像靠近哪个其他图像”,我们反而考虑一个明确定义的任务(在这种情况下是数字识别),我们可以获得明确的训练数据,然后利用这样一个事实,即在执行这个任务时,神经网络隐含地必须做出类似于“接近决策”的决策。因此,我们不必明确讨论“图像的接近度”,而只需讨论图像代表的数字是什么这个具体问题,然后“交给神经网络”隐含地确定这对“图像的接近度”意味着什么。
So how in more detail does this work for the digit recognition network? We can think of the network as consisting of 11 successive layers, that we might summarize iconically like this (with activation functions shown as separate layers):那么,这对于数字识别网络的工作细节是如何的呢?我们可以将网络看作由 11 个连续层组成,我们可以像这样图标化地总结(激活函数显示为单独的层):
At the beginning we’re feeding into the first layer actual images, represented by 2D arrays of pixel values. And at the end—from the last layer—we’re getting out an array of 10 values, which we can think of saying “how certain” the network is that the image corresponds to each of the digits 0 through 9.
在开始时,我们将实际图像输入到第一层,用 2D 像素值数组表示。而在最后一层,我们得到一个包含 10 个值的数组,我们可以认为这些值表示了网络认为图像与数字 0 到 9 中的每个数字对应的“确定程度”。
Feed in the image
and the values of the neurons in that last layer are:
输入图像 和该最后一层神经元的值为:
In other words, the neural net is by this point “incredibly certain” that this image is a 4—and to actually get the output “4” we just have to pick out the position of the neuron with the largest value.
换句话说,神经网络在这一点上“非常确定”这张图片是一个 4 - 而要实际获得输出“4”,我们只需挑选出具有最大值的神经元的位置。
But what if we look one step earlier? The very last operation in the network is a so-called
which tries to “force certainty”. But before that’s been applied the values of the neurons are:
但如果我们往前看一步呢?网络中的最后一个操作是所谓的 softmax,它试图“强制确定性”。但在应用之前,神经元的值是:
The neuron representing “4” still has the highest numerical value. But there’s also information in the values of the other neurons. And we can expect that this list of numbers can in a sense be used to characterize the “essence” of the image—and thus to provide something we can use as an embedding. And so, for example, each of the 4’s here has a slightly different “signature” (or “feature embedding”)—all very different from the 8’s:
代表“4”的神经元仍然具有最高的数值。但其他神经元的值中也包含信息。我们可以期待这个数字列表在某种意义上可以用来表征图像的“本质”,从而提供我们可以用作嵌入的东西。因此,例如,这里的每个 4 都有略有不同的“签名”(或“特征嵌入”)——与 8 完全不同:
Here we’re essentially using 10 numbers to characterize our images. But it’s often better to use much more than that. And for example in our digit recognition network we can get an array of 500 numbers by tapping into the preceding layer. And this is probably a reasonable array to use as an “image embedding”.在这里,我们基本上使用 10 个数字来描述我们的图像。但通常最好使用更多的数字。例如,在我们的数字识别网络中,我们可以通过访问前一层来获得一个包含 500 个数字的数组。这可能是一个合理的用作“图像嵌入”的数组。
If we want to make an explicit visualization of “image space” for handwritten digits we need to “reduce the dimension”, effectively by projecting the 500-dimensional vector we’ve got into, say, 3D space:如果我们想要为手写数字制作一个明确的“图像空间”可视化,我们需要“降维”,有效地将我们得到的 500 维向量投影到比如说 3D 空间中:
We’ve just talked about creating a characterization (and thus embedding) for images based effectively on identifying the similarity of images by determining whether (according to our training set) they correspond to the same handwritten digit. And we can do the same thing much more generally for images if we have a training set that identifies, say, which of 5000 common types of object (cat, dog, chair, …) each image is of. And in this way we can make an image embedding that’s “anchored” by our identification of common objects, but then “generalizes around that” according to the behavior of the neural net. And the point is that insofar as that behavior aligns with how we humans perceive and interpret images, this will end up being an embedding that “seems right to us”, and is useful in practice in doing “human-judgement-like” tasks.
我们刚刚讨论了如何有效地基于识别图像的相似性来创建图像的表征(从而嵌入),方法是通过确定(根据我们的训练集)图像是否对应于相同的手写数字。如果我们有一个能够识别每个图像属于 5000 种常见对象类型(猫、狗、椅子等)的训练集,我们也可以对图像做同样的事情。通过这种方式,我们可以创建一个由我们对常见对象的识别“锚定”的图像嵌入,然后根据神经网络的行为“泛化”。关键在于,只要这种行为与我们人类感知和解释图像的方式一致,这将最终成为一个“对我们来说看起来正确”的嵌入,并且在实践中对执行“类似人类判断”的任务非常有用。
OK, so how do we follow the same kind of approach to find embeddings for words? The key is to start from a task about words for which we can readily do training. And the standard such task is “word prediction”. Imagine we’re given “the ___ cat”. Based on a large corpus of text (say, the text content of the web), what are the probabilities for different words that might “fill in the blank”? Or, alternatively, given “___ black ___” what are the probabilities for different “flanking words”?好的,那么我们如何采用相同的方法来找到单词的嵌入呢?关键是从一个关于单词的任务开始,我们可以很容易进行训练。标准的任务是“单词预测”。想象一下,我们有“the ___ cat”。根据大量的文本语料库(比如,网络文本内容),不同单词可能“填补空白”的概率是多少?或者,给定“___ black ___”,不同“两侧单词”的概率是多少?
How do we set this problem up for a neural net? Ultimately we have to formulate everything in terms of numbers. And one way to do this is just to assign a unique number to each of the 50,000 or so common words in English. So, for example, “the” might be 914, and “ cat” (with a space before it) might be 3542. (And these are the actual numbers used by GPT-2.) So for the “the ___ cat” problem, our input might be {914, 3542}. What should the output be like? Well, it should be a list of 50,000 or so numbers that effectively give the probabilities for each of the possible “fill-in” words. And once again, to find an embedding, we want to “intercept” the “insides” of the neural net just before it “reaches its conclusion”—and then pick up the list of numbers that occur there, and that we can think of as “characterizing each word”.我们如何为神经网络设置这个问题?最终,我们必须用数字来表达一切。一种方法是为英语中的大约 50,000 个常见单词中的每一个分配一个唯一的数字。例如,“the”可能是 914,“cat”(前面有一个空格)可能是 3542。(这些是 GPT-2 使用的实际数字。)因此,对于“the ___ cat”问题,我们的输入可能是{914, 3542}。输出应该是什么样的呢?嗯,它应该是一个包含大约 50,000 个数字的列表,有效地给出每个可能的“填充”单词的概率。再次,为了找到一个嵌入,我们希望在神经网络“达到结论”之前“拦截”其“内部”,然后获取出现在那里的数字列表,我们可以将其视为“表征每个单词”。
OK, so what do those characterizations look like? Over the past 10 years there’ve been a sequence of different systems developed (word2vec, GloVe, BERT, GPT, …), each based on a different neural net approach. But ultimately all of them take words and characterize them by lists of hundreds to thousands of numbers.好的,那些特征是什么样的?在过去的 10 年里,已经开发了一系列不同的系统(如 word2vec、GloVe、BERT、GPT 等),每个系统都基于不同的神经网络方法。但最终,所有这些系统都会将单词转化为由数百到数千个数字组成的列表。
In their raw form, these “embedding vectors” are quite uninformative. For example, here’s what GPT-2 produces as the raw embedding vectors for three specific words:在其原始形式中,这些“嵌入向量”相当无信息。例如,这是 GPT-2 为三个特定单词生成的原始嵌入向量:
If we do things like measure distances between these vectors, then we can find things like “nearnesses” of words. Later we’ll discuss in more detail what we might consider the “cognitive” significance of such embeddings. But for now the main point is that we have a way to usefully turn words into “neural-net-friendly” collections of numbers.如果我们做类似测量这些向量之间的距离的事情,那么我们可以找到词语之间的“接近度”之类的东西。稍后我们将更详细地讨论我们可能认为这种嵌入的“认知”意义。但现在的主要观点是,我们有一种有用的方法,可以将单词转化为“神经网络友好”的数字集合。
But actually we can go further than just characterizing words by collections of numbers; we can also do this for sequences of words, or indeed whole blocks of text. And inside ChatGPT that’s how it’s dealing with things. It takes the text it’s got so far, and generates an embedding vector to represent it. Then its goal is to find the probabilities for different words that might occur next. And it represents its answer for this as a list of numbers that essentially give the probabilities for each of the 50,000 or so possible words.但实际上,我们可以进一步对单词进行表征,不仅可以用数字集合来描述单词;我们还可以对单词序列,甚至整个文本块进行这样的操作。在 ChatGPT 内部,它就是这样处理事情的。它获取到目前为止的文本,生成一个嵌入向量来表示它。然后,它的目标是找到可能出现的不同单词的概率。它将这个答案表示为一组数字,基本上给出了每个可能单词的概率,大约有 50,000 个左右。
(Strictly, ChatGPT does not deal with words, but rather with “tokens”—convenient linguistic units that might be whole words, or might just be pieces like “pre” or “ing” or “ized”. Working with tokens makes it easier for ChatGPT to handle rare, compound and non-English words, and, sometimes, for better or worse, to invent new words.)(严格来说,ChatGPT 不处理单词,而是处理“标记”——方便的语言单位,可能是整个单词,也可能只是像“pre”、“ing”或“ized”这样的部分。使用标记使 ChatGPT 更容易处理罕见、复合和非英语单词,有时,无论好坏,也更容易发明新单词。)
Inside ChatGPT ChatGPT 内部
OK, so we’re finally ready to discuss what’s inside ChatGPT. And, yes, ultimately, it’s a giant neural net—currently a version of the so-called GPT-3 network with 175 billion weights. In many ways this is a neural net very much like the other ones we’ve discussed. But it’s a neural net that’s particularly set up for dealing with language. And its most notable feature is a piece of neural net architecture called a “transformer”.好的,那么我们终于准备好讨论 ChatGPT 内部是什么了。是的,最终,它是一个巨大的神经网络——目前是一个拥有 1750 亿个权重的所谓 GPT-3 网络的版本。在许多方面,这是一个非常像我们讨论过的其他神经网络的神经网络。但它是一个特别为处理语言而设置的神经网络。它最显著的特征是一种名为“transformer”的神经网络架构。
In the first neural nets we discussed above, every neuron at any given layer was basically connected (at least with some weight) to every neuron on the layer before. But this kind of fully connected network is (presumably) overkill if one’s working with data that has particular, known structure. And thus, for example, in the early stages of dealing with images, it’s typical to use so-called convolutional neural nets (“convnets”) in which neurons are effectively laid out on a grid analogous to the pixels in the image—and connected only to neurons nearby on the grid.在上面我们讨论的第一个神经网络中,每个神经元在任何给定层都基本上与前一层的每个神经元连接(至少有一些权重)。但是,如果一个人正在处理具有特定已知结构的数据,这种全连接网络可能过度。因此,例如,在处理图像的早期阶段,通常使用所谓的卷积神经网络("convnets"),其中神经元有效地布置在类似于图像中的像素的网格上,并且仅与网格上附近的神经元连接。
The idea of transformers is to do something at least somewhat similar for sequences of tokens that make up a piece of text. But instead of just defining a fixed region in the sequence over which there can be connections, transformers instead introduce the notion of “attention”—and the idea of “paying attention” more to some parts of the sequence than others. Maybe one day it’ll make sense to just start a generic neural net and do all customization through training. But at least as of now it seems to be critical in practice to “modularize” things—as transformers do, and probably as our brains also do.变压器的概念是为了对组成文本片段的标记序列进行至少在某种程度上类似的处理。但是,变压器并不只是定义一个固定区域,使得序列中可以建立连接,而是引入了“注意力”的概念——以及“更多地关注”序列中的某些部分的想法。也许有一天,通过训练来进行所有定制的通用神经网络会变得合理。但至少目前来看,在实践中似乎至关重要的是“模块化”事物——就像变压器所做的,也可能是我们的大脑所做的。
OK, so what does ChatGPT (or, rather, the GPT-3 network on which it’s based) actually do? Recall that its overall goal is to continue text in a “reasonable” way, based on what it’s seen from the training it’s had (which consists in looking at billions of pages of text from the web, etc.) So at any given point, it’s got a certain amount of text—and its goal is to come up with an appropriate choice for the next token to add.好的,那么 ChatGPT(或者更确切地说,它所基于的 GPT-3 网络)实际上是做什么的呢?回想一下,它的总体目标是以“合理”的方式继续文本,基于它从训练中看到的内容(这包括查看来自网络等的数十亿页文本)。因此,在任何给定的时刻,它都有一定量的文本,其目标是提出适当的选择,以添加下一个标记。
It operates in three basic stages. First, it takes the sequence of tokens that corresponds to the text so far, and finds an embedding (i.e. an array of numbers) that represents these. Then it operates on this embedding—in a “standard neural net way”, with values “rippling through” successive layers in a network—to produce a new embedding (i.e. a new array of numbers). It then takes the last part of this array and generates from it an array of about 50,000 values that turn into probabilities for different possible next tokens. (And, yes, it so happens that there are about the same number of tokens used as there are common words in English, though only about 3000 of the tokens are whole words, and the rest are fragments.)它分为三个基本阶段。首先,它获取与迄今为止的文本相对应的标记序列,并找到代表这些标记的嵌入(即一组数字数组)。然后,它以“标准神经网络方式”对这个嵌入进行操作,数值“通过”网络中的连续层产生一个新的嵌入(即一个新的数字数组)。然后,它获取这个数组的最后部分,并从中生成大约 50,000 个值的数组,这些值转化为不同可能下一个标记的概率。(是的,碰巧有大约相同数量的标记用作英语中常见单词的数量,尽管只有大约 3000 个标记是完整单词,其余是片段。)
A critical point is that every part of this pipeline is implemented by a neural network, whose weights are determined by end-to-end training of the network. In other words, in effect nothing except the overall architecture is “explicitly engineered”; everything is just “learned” from training data.一个关键点是,这个管道的每个部分都是由神经网络实现的,其权重是由网络的端到端训练确定的。换句话说,实际上除了整体架构之外,没有任何东西是“明确设计”的;一切都只是从训练数据中“学习”的。
There are, however, plenty of details in the way the architecture is set up—reflecting all sorts of experience and neural net lore. And—even though this is definitely going into the weeds—I think it’s useful to talk about some of those details, not least to get a sense of just what goes into building something like ChatGPT.然而,建筑设置中有很多细节,反映了各种经验和神经网络传说。尽管这显然是在深入细节,但我认为谈论其中一些细节是有用的,至少可以让我们了解构建像 ChatGPT 这样的东西需要付出多少努力。
First comes the embedding module. Here’s a schematic Wolfram Language representation for it for GPT-2:首先是嵌入模块。这是 GPT-2 的 Wolfram 语言表示的示意图:
The input is a vector of n tokens (represented as in the previous section by integers from 1 to about 50,000). Each of these tokens is converted (by a single-layer neural net) into an embedding vector (of length 768 for GPT-2 and 12,288 for ChatGPT’s GPT-3). Meanwhile, there’s a “secondary pathway” that takes the sequence of (integer) positions for the tokens, and from these integers creates another embedding vector. And finally the embedding vectors from the token value and the token position are added together—to produce the final sequence of embedding vectors from the embedding module.输入是一个包含 n 个标记的向量(如前一节中所表示的,这些标记是从 1 到大约 50,000 的整数)。这些标记中的每一个都会被转换(通过一个单层神经网络)成一个嵌入向量(对于 GPT-2 是长度为 768,对于 ChatGPT 的 GPT-3 是长度为 12,288)。同时,还有一个“次要路径”,它接受标记的(整数)位置序列,并从这些整数中创建另一个嵌入向量。最后,从标记值和标记位置的嵌入向量相加,以产生来自嵌入模块的最终嵌入向量序列。
Why does one just add the token-value and token-position embedding vectors together? I don’t think there’s any particular science to this. It’s just that various different things have been tried, and this is one that seems to work. And it’s part of the lore of neural nets that—in some sense—so long as the setup one has is “roughly right” it’s usually possible to home in on details just by doing sufficient training, without ever really needing to “understand at an engineering level” quite how the neural net has ended up configuring itself.为什么要将令牌值和令牌位置嵌入向量相加呢?我认为这并没有什么特别的科学依据。只是尝试了各种不同的方法,这种方法似乎有效。神经网络的传统认为,只要设置大致正确,通常只需通过充分的训练就可以逐渐细化细节,而无需真正理解神经网络是如何配置自身的。
Here’s what the embedding module does, operating on the string hello hello hello hello hello hello hello hello hello hello bye bye bye bye bye bye bye bye bye bye:这是嵌入模块的操作,它在字符串 hello hello hello hello hello hello hello hello hello hello bye bye bye bye bye bye bye bye bye bye 上运行:
The elements of the embedding vector for each token are shown down the page, and across the page we see first a run of “hello” embeddings, followed by a run of “bye” ones. The second array above is the positional embedding—with its somewhat-random-looking structure being just what “happened to be learned” (in this case in GPT-2).每个标记的嵌入向量元素显示在页面下方,横跨页面,我们首先看到一系列“hello”嵌入,然后是一系列“bye”嵌入。上面的第二个数组是位置嵌入,其看起来有些随机的结构正好是“学到的”(在这种情况下是在 GPT-2 中学到的)。
OK, so after the embedding module comes the “main event” of the transformer: a sequence of so-called “attention blocks” (12 for GPT-2, 96 for ChatGPT’s GPT-3). It’s all pretty complicated—and reminiscent of typical large hard-to-understand engineering systems, or, for that matter, biological systems. But anyway, here’s a schematic representation of a single “attention block” (for GPT-2):好的,在嵌入模块之后是变压器的“主要事件”:一系列所谓的“注意力块”(GPT-2 有 12 个,ChatGPT 的 GPT-3 有 96 个)。这一切都相当复杂,让人想起典型的大型难以理解的工程系统,或者说生物系统。但无论如何,这里是一个单个“注意力块”的示意图(适用于 GPT-2):
Within each such attention block there are a collection of “attention heads” (12 for GPT-2, 96 for ChatGPT’s GPT-3)—each of which operates independently on different chunks of values in the embedding vector. (And, yes, we don’t know any particular reason why it’s a good idea to split up the embedding vector, or what the different parts of it “mean”; this is just one of those things that’s been “found to work”.)
在每个这样的注意力块中,都有一组“注意力头”(GPT-2 有 12 个,ChatGPT 的 GPT-3 有 96 个)——每个注意力头在嵌入向量中的不同数值块上独立运行。(是的,我们不知道为什么将嵌入向量分割是个好主意,或者它的不同部分“意味着”什么;这只是已经“被发现有效”的事情之一。)
OK, so what do the attention heads do? Basically they’re a way of “looking back” in the sequence of tokens (i.e. in the text produced so far), and “packaging up the past” in a form that’s useful for finding the next token. In the first section above we talked about using 2-gram probabilities to pick words based on their immediate predecessors. What the “attention” mechanism in transformers does is to allow “attention to” even much earlier words—thus potentially capturing the way, say, verbs can refer to nouns that appear many words before them in a sentence.好的,那么注意力头是做什么的?基本上,它们是一种在令牌序列中“回顾”的方式(即在迄今为止生成的文本中),并以一种对于找到下一个令牌有用的形式“打包过去”。在上面的第一节中,我们谈到使用 2 元概率来选择基于它们的直接前导词的单词。变压器中的“注意力”机制所做的是允许“关注”甚至更早的单词,从而潜在地捕捉动词可以指代出现在句子中许多单词之前的名词的方式。
At a more detailed level, what an attention head does is to recombine chunks in the embedding vectors associated with different tokens, with certain weights. And so, for example, the 12 attention heads in the first attention block (in GPT-2) have the following (“look-back-all-the-way-to-the-beginning-of-the-sequence-of-tokens”) patterns of “recombination weights” for the “hello, bye” string above:在更详细的层面上,一个注意力头的作用是重新组合与不同标记相关联的嵌入向量中的块,具有一定的权重。因此,例如,在第一个注意力块(在 GPT-2 中)中的 12 个注意力头具有上述“重新组合权重”的“hello, bye”字符串的“一直回溯到标记序列开头”的模式。
After being processed by the attention heads, the resulting “re-weighted embedding vector” (of length 768 for GPT-2 and length 12,288 for ChatGPT’s GPT-3) is passed through a standard
. It’s hard to get a handle on what this layer is doing. But here’s a plot of the 768×768 matrix of weights it’s using (here for GPT-2):
经过注意力头处理后,生成的“重新加权嵌入向量”(对于 GPT-2 长度为 768,对于 ChatGPT 的 GPT-3 长度为 12,288)通过标准的“全连接”神经网络层。很难掌握这一层在做什么。但这里是它正在使用的 768×768 权重矩阵的图表(这里是针对 GPT-2 的)。
Taking 64×64 moving averages, some (random-walk-ish) structure begins to emerge:
采用 64×64 移动平均值,一些(类似随机漫步的)结构开始显现:
What determines this structure? Ultimately it’s presumably some “neural net encoding” of features of human language. But as of now, what those features might be is quite unknown. In effect, we’re “opening up the brain of ChatGPT” (or at least GPT-2) and discovering, yes, it’s complicated in there, and we don’t understand it—even though in the end it’s producing recognizable human language.这种结构是由什么决定的?最终可能是人类语言特征的某种“神经网络编码”。但就目前而言,这些特征可能是什么还是相当未知的。实际上,我们正在“打开 ChatGPT”(或至少是 GPT-2)的大脑,并发现,是的,在那里很复杂,我们并不理解它——尽管最终它产生了可识别的人类语言。
OK, so after going through one attention block, we’ve got a new embedding vector—which is then successively passed through additional attention blocks (a total of 12 for GPT-2; 96 for GPT-3). Each attention block has its own particular pattern of “attention” and “fully connected” weights. Here for GPT-2 are the sequence of attention weights for the “hello, bye” input, for the first attention head:好的,所以经过一个注意力块后,我们得到了一个新的嵌入向量—然后依次通过额外的注意力块(对于 GPT-2 共 12 个;对于 GPT-3 共 96 个)。每个注意力块都有自己特定的“注意力”和“全连接”权重模式。这里是 GPT-2 中“hello, bye”输入的第一个注意力头的注意力权重序列:
And here are the (moving-averaged) “matrices” for the fully connected layers:
这里是全连接层的(移动平均)“矩阵”:
Curiously, even though these “matrices of weights” in different attention blocks look quite similar, the distributions of the sizes of weights can be somewhat different (and are not always Gaussian):
有趣的是,尽管不同注意力块中的这些“权重矩阵”看起来非常相似,但权重大小的分布可能会有所不同(并不总是高斯分布):
So after going through all these attention blocks what is the net effect of the transformer? Essentially it’s to transform the original collection of embeddings for the sequence of tokens to a final collection. And the particular way ChatGPT works is then to pick up the last embedding in this collection, and “decode” it to produce a list of probabilities for what token should come next.经过所有这些注意力块后,变压器的净效果是什么?本质上,它是将令牌序列的原始嵌入集合转换为最终集合。ChatGPT 的特定工作方式是从该集合中选择最后一个嵌入,并“解码”它以生成下一个应该出现的令牌的概率列表。
So that’s in outline what’s inside ChatGPT. It may seem complicated (not least because of its many inevitably somewhat arbitrary “engineering choices”), but actually the ultimate elements involved are remarkably simple. Because in the end what we’re dealing with is just a neural net made of “artificial neurons”, each doing the simple operation of taking a collection of numerical inputs, and then combining them with certain weights.这就是 ChatGPT 内部的大致情况。这可能看起来很复杂(主要是因为其中许多不可避免的有些任意的“工程选择”),但实际上所涉及的最终元素非常简单。因为最终我们所处理的只是一个由“人工神经元”组成的神经网络,每个神经元都执行将一组数字输入进行简单操作,然后用特定权重组合它们。
The original input to ChatGPT is an array of numbers (the embedding vectors for the tokens so far), and what happens when ChatGPT “runs” to produce a new token is just that these numbers “ripple through” the layers of the neural net, with each neuron “doing its thing” and passing the result to neurons on the next layer. There’s no looping or “going back”. Everything just “feeds forward” through the network.ChatGPT 的原始输入是一组数字(到目前为止令牌的嵌入向量),当 ChatGPT“运行”以生成新令牌时,这些数字只是“涟漪”通过神经网络的层,每个神经元“做自己的事情”并将结果传递给下一层的神经元。没有循环或“回溯”。一切都只是通过网络“前馈”。
It’s a very different setup from a typical computational system—like a Turing machine—in which results are repeatedly “reprocessed” by the same computational elements. Here—at least in generating a given token of output—each computational element (i.e. neuron) is used only once.这是与典型的计算系统(如图灵机)非常不同的设置,在这种系统中,结果会被相同的计算元素反复“重新处理”。在这里,至少在生成给定的输出令牌时,每个计算元素(即神经元)只被使用一次。
But there is in a sense still an “outer loop” that reuses computational elements even in ChatGPT. Because when ChatGPT is going to generate a new token, it always “reads” (i.e. takes as input) the whole sequence of tokens that come before it, including tokens that ChatGPT itself has “written” previously. And we can think of this setup as meaning that ChatGPT does—at least at its outermost level—involve a “feedback loop”, albeit one in which every iteration is explicitly visible as a token that appears in the text that it generates.但在某种意义上,即使在 ChatGPT 中仍然存在一个“外部循环”来重复使用计算元素。因为当 ChatGPT 要生成一个新的标记时,它总是“阅读”(即将其之前的所有标记作为输入),包括 ChatGPT 自己之前“编写”的标记。我们可以将这种设置看作是意味着 ChatGPT 至少在其最外层涉及一个“反馈循环”,尽管每次迭代都明确可见,作为出现在其生成的文本中的标记。
But let’s come back to the core of ChatGPT: the neural net that’s being repeatedly used to generate each token. At some level it’s very simple: a whole collection of identical artificial neurons. And some parts of the network just consist of (“fully connected”) layers of neurons in which every neuron on a given layer is connected (with some weight) to every neuron on the layer before. But particularly with its transformer architecture, ChatGPT has parts with more structure, in which only specific neurons on different layers are connected. (Of course, one could still say that “all neurons are connected”—but some just have zero weight.)但让我们回到 ChatGPT 的核心:神经网络被反复用来生成每个令牌。在某种程度上,它非常简单:一整套相同的人工神经元。网络的一些部分只是由(“全连接”)神经元层组成,其中给定层上的每个神经元都与前一层上的每个神经元连接(带有一些权重)。但特别是在其变压器架构中,ChatGPT 具有更多结构的部分,其中只连接不同层上的特定神经元。(当然,人们仍然可以说“所有神经元都连接在一起” - 但有些神经元权重为零。)
In addition, there are aspects of the neural net in ChatGPT that aren’t most naturally thought of as just consisting of “homogeneous” layers. And for example—as the iconic summary above indicates—inside an attention block there are places where “multiple copies are made” of incoming data, each then going through a different “processing path”, potentially involving a different number of layers, and only later recombining. But while this may be a convenient representation of what’s going on, it’s always at least in principle possible to think of “densely filling in” layers, but just having some weights be zero.此外,在 ChatGPT 中,神经网络的某些方面并不是最自然地被认为只是由“同质”层组成。例如,正如上面的标志性摘要所示,在注意力块内部有一些地方会“复制多份”传入数据,然后每个数据经过不同的“处理路径”,可能涉及不同数量的层,只有在稍后才重新组合。虽然这可能是对正在发生的事情的一种方便表示,但至少从原则上讲,可以考虑“密集填充”层,只是让一些权重为零。
If one looks at the longest path through ChatGPT, there are about 400 (core) layers involved—in some ways not a huge number. But there are millions of neurons—with a total of 175 billion connections and therefore 175 billion weights. And one thing to realize is that every time ChatGPT generates a new token, it has to do a calculation involving every single one of these weights. Implementationally these calculations can be somewhat organized “by layer” into highly parallel array operations that can conveniently be done on GPUs. But for each token that’s produced, there still have to be 175 billion calculations done (and in the end a bit more)—so that, yes, it’s not surprising that it can take a while to generate a long piece of text with ChatGPT.如果我们看一下通过 ChatGPT 的最长路径,涉及大约 400 个(核心)层——在某种程度上并不是一个巨大的数字。但是有数百万个神经元——总共有 1750 亿个连接,因此有 1750 亿个权重。要意识到的一件事是,每次 ChatGPT 生成一个新的标记时,它都必须进行一次涉及所有这些权重的计算。在实现上,这些计算可以在高度并行的数组操作中有些组织“按层”进行,这些操作可以方便地在 GPU 上完成。但对于每个生成的标记,仍然必须进行 1750 亿次计算(最后还要多一点)——所以,是的,生成一段长文本可能需要一段时间。
But in the end, the remarkable thing is that all these operations—individually as simple as they are—can somehow together manage to do such a good “human-like” job of generating text. It has to be emphasized again that (at least so far as we know) there’s no “ultimate theoretical reason” why anything like this should work. And in fact, as we’ll discuss, I think we have to view this as a—potentially surprising—scientific discovery: that somehow in a neural net like ChatGPT’s it’s possible to capture the essence of what human brains manage to do in generating language.但最终,值得注意的是,所有这些操作——虽然单独看起来很简单——却可以一起完成如此出色的“类人”文本生成工作。必须再次强调的是(至少就我们所知),我们不知道为什么类似这样的事情会奏效的“最终理论原因”。事实上,正如我们将要讨论的那样,我认为我们必须将这视为一项——可能令人惊讶的——科学发现:在像 ChatGPT 这样的神经网络中,一种可能性是可以捕捉到人类大脑在生成语言方面所做的本质。
The Training of ChatGPTChatGPT 的培训
OK, so we’ve now given an outline of how ChatGPT works once it’s set up. But how did it get set up? How were all those 175 billion weights in its neural net determined? Basically they’re the result of very large-scale training, based on a huge corpus of text—on the web, in books, etc.—written by humans. As we’ve said, even given all that training data, it’s certainly not obvious that a neural net would be able to successfully produce “human-like” text. And, once again, there seem to be detailed pieces of engineering needed to make that happen. But the big surprise—and discovery—of ChatGPT is that it’s possible at all. And that—in effect—a neural net with “just” 175 billion weights can make a “reasonable model” of text humans write.好的,现在我们已经概述了 ChatGPT 在设置后的工作原理。但是它是如何设置的呢?它的神经网络中的那 1750 亿个权重是如何确定的呢?基本上,它们是通过基于人类撰写的大量文本语料库(在网络上、书籍中等)进行的大规模训练的结果。正如我们所说,即使给定了所有这些训练数据,神经网络能够成功产生“类似人类”的文本显然并不明显。而且,再次强调,似乎需要详细的工程技术来实现这一点。但是 ChatGPT 的一个重大惊喜和发现是,这是完全可能的。实际上,一个拥有“仅”1750 亿个权重的神经网络可以制作出人类撰写的“合理模型”文本。
In modern times, there’s lots of text written by humans that’s out there in digital form. The public web has at least several billion human-written pages, with altogether perhaps a trillion words of text. And if one includes non-public webpages, the numbers might be at least 100 times larger. So far, more than 5 million digitized books have been made available (out of 100 million or so that have ever been published), giving another 100 billion or so words of text. And that’s not even mentioning text derived from speech in videos, etc. (As a personal comparison, my total lifetime output of published material has been a bit under 3 million words, and over the past 30 years I’ve written about 15 million words of email, and altogether typed perhaps 50 million words—and in just the past couple of years I’ve spoken more than 10 million words on livestreams. And, yes, I’ll train a bot from all of that.)在现代,有许多人类编写的文本以数字形式存在。公共网络上至少有数十亿人类编写的页面,总共可能有万亿字的文本。如果包括非公开网页,这个数字可能至少是 100 倍以上。到目前为止,已经有超过 500 万本数字化图书可供使用(总共大约有 1 亿本左右出版的书籍),提供了另外约 1000 亿字的文本。更不用说从视频等中提取的文本了。(作为个人比较,我一生中发表的总文字量略少于 300 万字,过去 30 年来我写了约 1500 万字的电子邮件,总共可能输入了大约 5000 万字——仅在过去几年里,我在直播中说了超过 1000 万字。是的,我将从中训练一个机器人。)
But, OK, given all this data, how does one train a neural net from it? The basic process is very much as we discussed it in the simple examples above. You present a batch of examples, and then you adjust the weights in the network to minimize the error (“loss”) that the network makes on those examples. The main thing that’s expensive about “back propagating” from the error is that each time you do this, every weight in the network will typically change at least a tiny bit, and there are just a lot of weights to deal with. (The actual “back computation” is typically only a small constant factor harder than the forward one.)但是,好吧,考虑到所有这些数据,如何从中训练神经网络呢?基本过程与我们在上面讨论的简单示例中讨论的非常相似。您提供一批示例,然后调整网络中的权重,以最小化网络在这些示例上产生的错误(“损失”)。从错误“反向传播”的主要昂贵之处在于,每次执行此操作时,网络中的每个权重通常至少会稍微改变一点,而且要处理的权重数量很多。(实际的“反向计算”通常只比前向计算难一点点。)
With modern GPU hardware, it’s straightforward to compute the results from batches of thousands of examples in parallel. But when it comes to actually updating the weights in the neural net, current methods require one to do this basically batch by batch. (And, yes, this is probably where actual brains—with their combined computation and memory elements—have, for now, at least an architectural advantage.)使用现代 GPU 硬件,可以很容易地并行计算成千上万个示例的结果。但是,当涉及实际更新神经网络中的权重时,当前的方法要求基本上逐批次进行。 (是的,这可能是实际大脑目前至少在架构上具有优势的地方,因为它们具有结合计算和记忆元素。)
Even in the seemingly simple cases of learning numerical functions that we discussed earlier, we found we often had to use millions of examples to successfully train a network, at least from scratch. So how many examples does this mean we’ll need in order to train a “human-like language” model? There doesn’t seem to be any fundamental “theoretical” way to know. But in practice ChatGPT was successfully trained on a few hundred billion words of text.即使在我们之前讨论的学习数字函数的看似简单案例中,我们发现我们经常需要使用数百万个示例才能成功地训练一个网络,至少是从头开始。那么这意味着我们需要多少示例来训练一个“类人语言”模型呢?似乎没有任何基本的“理论”方法可以知道。但在实践中,ChatGPT 成功地在数千亿字的文本上进行了训练。
Some of the text it was fed several times, some of it only once. But somehow it “got what it needed” from the text it saw. But given this volume of text to learn from, how large a network should it require to “learn it well”? Again, we don’t yet have a fundamental theoretical way to say. Ultimately—as we’ll discuss further below—there’s presumably a certain “total algorithmic content” to human language and what humans typically say with it. But the next question is how efficient a neural net will be at implementing a model based on that algorithmic content. And again we don’t know—although the success of ChatGPT suggests it’s reasonably efficient.有些文本被输入了多次,有些只输入了一次。但不知何故,它从看到的文本中“得到了所需的内容”。但是,考虑到这么多文本需要学习,它需要多大规模的网络才能“学得好”呢?同样,我们目前还没有一种基本的理论方法来说。最终——正如我们将在下文进一步讨论的那样——人类语言和人类通常用它说的话可能存在一定的“总算法内容”。但接下来的问题是,神经网络在基于该算法内容实现模型时会有多高效。同样,我们不知道——尽管 ChatGPT 的成功表明它可能相当高效。
And in the end we can just note that ChatGPT does what it does using a couple hundred billion weights—comparable in number to the total number of words (or tokens) of training data it’s been given. In some ways it’s perhaps surprising (though empirically observed also in smaller analogs of ChatGPT) that the “size of the network” that seems to work well is so comparable to the “size of the training data”. After all, it’s certainly not that somehow “inside ChatGPT” all that text from the web and books and so on is “directly stored”. Because what’s actually inside ChatGPT are a bunch of numbers—with a bit less than 10 digits of precision—that are some kind of distributed encoding of the aggregate structure of all that text.最后我们可以指出,ChatGPT 所做的事情是使用数百亿个权重——与其所接收的训练数据的总词数(或标记)相当。在某种程度上,也许令人惊讶的是(尽管在 ChatGPT 的较小模拟中也有经验观察到),“网络的规模”似乎与“训练数据的规模”如此相近,而且表现良好。毕竟,ChatGPT 内部并不是以某种方式“直接存储”来自网络、书籍等的所有文本。因为实际上存储在 ChatGPT 内部的是一堆数字——精度略低于 10 位数字——这些数字是对所有文本的聚合结构的某种分布式编码。
Put another way, we might ask what the “effective information content” is of human language and what’s typically said with it. There’s the raw corpus of examples of language. And then there’s the representation in the neural net of ChatGPT. That representation is very likely far from the “algorithmically minimal” representation (as we’ll discuss below). But it’s a representation that’s readily usable by the neural net. And in this representation it seems there’s in the end rather little “compression” of the training data; it seems on average to basically take only a bit less than one neural net weight to carry the “information content” of a word of training data.换句话说,我们可以问人类语言的“有效信息内容”是什么,以及通常用它说了什么。有语言示例的原始语料库。然后是 ChatGPT 神经网络中的表示。该表示很可能远离“算法上最小”的表示(如下面我们将讨论的)。但这是一个神经网络可以轻松使用的表示。在这种表示中,似乎最终训练数据的“压缩”相当少;平均来看,基本上只需要不到一个神经网络权重来携带训练数据的“信息内容”。
When we run ChatGPT to generate text, we’re basically having to use each weight once. So if there are n weights, we’ve got of order n computational steps to do—though in practice many of them can typically be done in parallel in GPUs. But if we need about n words of training data to set up those weights, then from what we’ve said above we can conclude that we’ll need about n2 computational steps to do the training of the network—which is why, with current methods, one ends up needing to talk about billion-dollar training efforts.当我们运行 ChatGPT 生成文本时,基本上需要使用每个权重一次。因此,如果有 n 个权重,我们需要大约 n 个计算步骤来完成,尽管在实践中,其中许多步骤通常可以在 GPU 中并行完成。但是,如果我们需要大约 n 个单词的训练数据来设置这些权重,那么根据我们上面所说的,我们可以得出结论,我们将需要大约 n 个计算步骤来训练网络——这就是为什么使用当前方法,人们最终需要谈论数十亿美元的培训工作。
Beyond Basic Training 超越基础训练
The majority of the effort in training ChatGPT is spent “showing it” large amounts of existing text from the web, books, etc. But it turns out there’s another—apparently rather important—part too.在训练 ChatGPT 的过程中,大部分的工作都花在向它展示大量来自网络、书籍等的现有文本上。但事实证明,还有另一个——显然相当重要的——部分。
As soon as it’s finished its “raw training” from the original corpus of text it’s been shown, the neural net inside ChatGPT is ready to start generating its own text, continuing from prompts, etc. But while the results from this may often seem reasonable, they tend—particularly for longer pieces of text—to “wander off” in often rather non-human-like ways. It’s not something one can readily detect, say, by doing traditional statistics on the text. But it’s something that actual humans reading the text easily notice.一旦 ChatGPT 内部的神经网络完成了从原始文本语料库中的“原始训练”,它就准备好开始生成自己的文本,继续从提示等方面进行。但是,尽管这样产生的结果通常看起来是合理的,但它们往往会以相当不像人类的方式“偏离”——尤其是对于较长的文本片段。这不是通过对文本进行传统统计就能轻易检测到的事情。但实际阅读文本的人很容易注意到这一点。
And a key idea in the construction of ChatGPT was to have another step after “passively reading” things like the web: to have actual humans actively interact with ChatGPT, see what it produces, and in effect give it feedback on “how to be a good chatbot”. But how can the neural net use that feedback? The first step is just to have humans rate results from the neural net. But then another neural net model is built that attempts to predict those ratings. But now this prediction model can be run—essentially like a loss function—on the original network, in effect allowing that network to be “tuned up” by the human feedback that’s been given. And the results in practice seem to have a big effect on the success of the system in producing “human-like” output.ChatGPT 构建中的一个关键思想是在“被动阅读”诸如网络之类的事物之后再增加另一步骤:让实际人类积极与 ChatGPT 互动,看看它产生了什么,并实际上给予它“如何成为一个好的聊天机器人”的反馈。但神经网络如何利用这些反馈呢?第一步只是让人类评价神经网络的结果。然后建立另一个神经网络模型,试图预测这些评分。但现在这个预测模型可以运行——本质上类似于损失函数——在原始网络上,实际上允许该网络通过已给出的人类反馈进行“调整”。实践中的结果似乎对系统成功产生“类人”输出有很大影响。
In general, it’s interesting how little “poking” the “originally trained” network seems to need to get it to usefully go in particular directions. One might have thought that to have the network behave as if it’s “learned something new” one would have to go in and run a training algorithm, adjusting weights, and so on.一般来说,有趣的是,“最初训练过的”网络似乎需要很少的“刺激”就能使其有针对性地前进。人们可能会认为,要使网络表现得好像“学到了新东西”,就必须进行训练算法的运行,调整权重等操作。
But that’s not the case. Instead, it seems to be sufficient to basically tell ChatGPT something one time—as part of the prompt you give—and then it can successfully make use of what you told it when it generates text. And once again, the fact that this works is, I think, an important clue in understanding what ChatGPT is “really doing” and how it relates to the structure of human language and thinking.但事实并非如此。相反,似乎只需基本告诉 ChatGPT 一次——作为您提供的提示的一部分——然后它就可以成功地利用您告诉它的内容来生成文本。而且再次强调,我认为这种方法有效的事实是理解 ChatGPT 到底在“做什么”以及它如何与人类语言和思维结构相关的重要线索。
There’s certainly something rather human-like about it: that at least once it’s had all that pre-training you can tell it something just once and it can “remember it”—at least “long enough” to generate a piece of text using it. So what’s going on in a case like this? It could be that “everything you might tell it is already in there somewhere”—and you’re just leading it to the right spot. But that doesn’t seem plausible. Instead, what seems more likely is that, yes, the elements are already in there, but the specifics are defined by something like a “trajectory between those elements” and that’s what you’re introducing when you tell it something.肯定有一些相当类似人类的东西:至少一次它经过了所有的预训练,你只需告诉它一次,它就能“记住”——至少“足够长时间”来生成一段文本。那么在这种情况下到底发生了什么?可能是“你可能告诉它的一切都已经在那里某个地方了”,你只是引导它到正确的位置。但这似乎不太可能。相反,更有可能的是,是的,元素已经在那里了,但具体内容是由“这些元素之间的轨迹”之类的东西定义的,当你告诉它一些东西时,这就是你正在介绍的内容。
And indeed, much like for humans, if you tell it something bizarre and unexpected that completely doesn’t fit into the framework it knows, it doesn’t seem like it’ll successfully be able to “integrate” this. It can “integrate” it only if it’s basically riding in a fairly simple way on top of the framework it already has.事实上,就像对人类一样,如果你告诉它一些奇怪和意想不到的事情,完全不符合它已知的框架,它似乎无法成功地“整合”这一点。只有当它基本上以一种相当简单的方式在它已经拥有的框架之上运行时,它才能“整合”它。
It’s also worth pointing out again that there are inevitably “algorithmic limits” to what the neural net can “pick up”. Tell it “shallow” rules of the form “this goes to that”, etc., and the neural net will most likely be able to represent and reproduce these just fine—and indeed what it “already knows” from language will give it an immediate pattern to follow. But try to give it rules for an actual “deep” computation that involves many potentially computationally irreducible steps and it just won’t work. (Remember that at each step it’s always just “feeding data forward” in its network, never looping except by virtue of generating new tokens.)值得再次指出的是,神经网络在“捕捉”信息方面不可避免地存在“算法限制”。告诉它“表浅”的规则,比如“这个转换为那个”,神经网络很可能能够很好地表示和复制这些规则,而且它从语言中“已知”的内容将为其提供一个立即可遵循的模式。但是,如果尝试为涉及许多潜在计算不可简化步骤的实际“深度”计算提供规则,它就无法胜任。(请记住,在每一步中,它总是在网络中“向前传递数据”,除了生成新的标记外,永远不会循环。)
Of course, the network can learn the answer to specific “irreducible” computations. But as soon as there are combinatorial numbers of possibilities, no such “table-lookup-style” approach will work. And so, yes, just like humans, it’s time then for neural nets to “reach out” and use actual computational tools. (And, yes, Wolfram|Alpha and Wolfram Language are uniquely suitable, because they’ve been built to “talk about things in the world”, just like the language-model neural nets.)当然,网络可以学习特定“不可简化”的计算的答案。但是一旦涉及到组合数量的可能性,就不会有这种“查表式”的方法奏效。因此,是的,就像人类一样,现在是神经网络“伸手”并使用实际的计算工具的时候了。(是的,Wolfram|Alpha 和 Wolfram 语言非常适合,因为它们被构建为“谈论世界中的事物”,就像语言模型神经网络一样。)
What Really Lets ChatGPT Work?什么让 ChatGPT 真正发挥作用?
Human language—and the processes of thinking involved in generating it—have always seemed to represent a kind of pinnacle of complexity. And indeed it’s seemed somewhat remarkable that human brains—with their network of a “mere” 100 billion or so neurons (and maybe 100 trillion connections) could be responsible for it. Perhaps, one might have imagined, there’s something more to brains than their networks of neurons—like some new layer of undiscovered physics. But now with ChatGPT we’ve got an important new piece of information: we know that a pure, artificial neural network with about as many connections as brains have neurons is capable of doing a surprisingly good job of generating human language.人类语言——以及生成它所涉及的思维过程——似乎一直代表着一种复杂性的顶峰。事实上,人类大脑——具有“仅仅”约 1000 亿个神经元(也许有 1 万亿个连接)的网络——能够负责这一切,似乎有些引人注目。也许,人们可能会想象,大脑除了神经元网络之外还有其他东西——比如一些未被发现的新物理层面。但现在有了 ChatGPT,我们得到了一条重要的新信息:我们知道,一个纯粹的人工神经网络,具有与大脑神经元数量相当的连接,能够出人意料地很好地生成人类语言。
And, yes, that’s still a big and complicated system—with about as many neural net weights as there are words of text currently available out there in the world. But at some level it still seems difficult to believe that all the richness of language and the things it can talk about can be encapsulated in such a finite system. Part of what’s going on is no doubt a reflection of the ubiquitous phenomenon (that first became evident in the example of rule 30) that computational processes can in effect greatly amplify the apparent complexity of systems even when their underlying rules are simple. But, actually, as we discussed above, neural nets of the kind used in ChatGPT tend to be specifically constructed to restrict the effect of this phenomenon—and the computational irreducibility associated with it—in the interest of making their training more accessible.而且,是的,这仍然是一个庞大而复杂的系统——拥有与世界上当前可用的文字数量一样多的神经网络权重。但在某种程度上,仍然很难相信语言的丰富性和它所能谈论的事物可以被封装在这样一个有限的系统中。正在发生的部分无疑是普遍现象的反映(最初在规则 30 的例子中首次显现),即计算过程实际上可以大大放大系统的表面复杂性,即使它们的基本规则很简单。但实际上,正如我们上面讨论的那样,ChatGPT 中使用的这种类型的神经网络往往是专门构建的,以限制这种现象的影响——以及与之相关的计算不可简化,以便使它们的训练更易于访问。
So how is it, then, that something like ChatGPT can get as far as it does with language? The basic answer, I think, is that language is at a fundamental level somehow simpler than it seems. And this means that ChatGPT—even with its ultimately straightforward neural net structure—is successfully able to “capture the essence” of human language and the thinking behind it. And moreover, in its training, ChatGPT has somehow “implicitly discovered” whatever regularities in language (and thinking) make this possible.那么,像 ChatGPT 这样的东西是如何在语言方面取得如此大的进展的呢?我认为,基本答案是,语言在某种基本层面上比看起来要简单。这意味着,即使 ChatGPT 最终的神经网络结构非常直接,它仍然成功地能够“捕捉”人类语言及其背后的思维的“本质”。而且,在训练过程中,ChatGPT 不知何故“隐式地发现”了使这一切成为可能的语言(和思维)中的任何规律。
The success of ChatGPT is, I think, giving us evidence of a fundamental and important piece of science: it’s suggesting that we can expect there to be major new “laws of language”—and effectively “laws of thought”—out there to discover. In ChatGPT—built as it is as a neural net—those laws are at best implicit. But if we could somehow make the laws explicit, there’s the potential to do the kinds of things ChatGPT does in vastly more direct, efficient—and transparent—ways.ChatGPT 的成功,我认为,为我们提供了一项基础和重要的科学证据:它表明我们可以期待发现一些重要的新“语言法则”——以及有效的“思维法则”。在 ChatGPT 中——作为一个神经网络构建的——这些法则最多是隐含的。但如果我们能够以某种方式使这些法则明确化,就有潜力以更加直接、高效和透明的方式做 ChatGPT 所做的事情。
But, OK, so what might these laws be like? Ultimately they must give us some kind of prescription for how language—and the things we say with it—are put together. Later we’ll discuss how “looking inside ChatGPT” may be able to give us some hints about this, and how what we know from building computational language suggests a path forward. But first let’s discuss two long-known examples of what amount to “laws of language”—and how they relate to the operation of ChatGPT.但是,好吧,那么这些法则可能是什么样的呢?最终,它们必须给我们一些关于语言以及我们用它说的话是如何组合在一起的指导。稍后我们将讨论“查看 ChatGPT 内部”可能如何给我们一些关于这一点的提示,以及从构建计算语言中所知道的内容如何提示前进的道路。但首先让我们讨论两个长期已知的关于“语言法则”的例子,以及它们如何与 ChatGPT 的运作相关。
The first is the syntax of language. Language is not just a random jumble of words. Instead, there are (fairly) definite grammatical rules for how words of different kinds can be put together: in English, for example, nouns can be preceded by adjectives and followed by verbs, but typically two nouns can’t be right next to each other. Such grammatical structure can (at least approximately) be captured by a set of rules that define how what amount to “parse trees” can be put together:首先是语言的句法。语言不仅仅是一堆随机的词语。相反,不同种类的词语如何组合在一起有(相当)明确的语法规则:例如,在英语中,名词可以由形容词前置,动词后置,但通常两个名词不能紧挨在一起。这种语法结构可以(至少近似地)通过一组定义“解析树”如何组合的规则来捕捉:
ChatGPT doesn’t have any explicit “knowledge” of such rules. But somehow in its training it implicitly “discovers” them—and then seems to be good at following them. So how does this work? At a “big picture” level it’s not clear. But to get some insight it’s perhaps instructive to look at a much simpler example.ChatGPT 并没有对这些规则有明确的“知识”。但在训练过程中,它以某种方式隐含地“发现”了这些规则,然后似乎擅长遵循这些规则。那么这是如何运作的呢?在“宏观”层面上并不清楚。但为了获得一些见解,也许看一个更简单的例子会有帮助。
Consider a “language” formed from sequences of (’s and )’s, with a grammar that specifies that parentheses should always be balanced, as represented by a parse tree like:考虑一个由(和)序列形成的“语言”,其语法规定括号应始终保持平衡,如由解析树表示的那样:
Can we train a neural net to produce “grammatically correct” parenthesis sequences? There are various ways to handle sequences in neural nets, but let’s use transformer nets, as ChatGPT does. And given a simple transformer net, we can start feeding it grammatically correct parenthesis sequences as training examples. A subtlety (which actually also appears in ChatGPT’s generation of human language) is that in addition to our “content tokens” (here “(” and “)”) we have to include an “End” token, that’s generated to indicate that the output shouldn’t continue any further (i.e. for ChatGPT, that one’s reached the “end of the story”).我们能训练神经网络生成“语法正确”的括号序列吗?在神经网络中处理序列的方法有很多种,但让我们使用变压器网络,就像 ChatGPT 一样。鉴于一个简单的变压器网络,我们可以开始将语法正确的括号序列作为训练样本输入。一个微妙之处(实际上也出现在 ChatGPT 生成人类语言的过程中)是,除了我们的“内容标记”(这里是“(”和“)”),我们还必须包括一个“结束”标记,用于指示输出不应继续(即对于 ChatGPT,表示已经到达“故事的结尾”)。
If we set up a transformer net with just one attention block with 8 heads and feature vectors of length 128 (ChatGPT also uses feature vectors of length 128, but has 96 attention blocks, each with 96 heads) then it doesn’t seem possible to get it to learn much about parenthesis language. But with 2 attention blocks, the learning process seems to converge—at least after 10 million or so examples have been given (and, as is common with transformer nets, showing yet more examples just seems to degrade its performance).如果我们只使用一个具有 8 个头和长度为 128 的特征向量的注意力块来设置变压器网络(ChatGPT 也使用长度为 128 的特征向量,但具有 96 个注意力块,每个块有 96 个头),那么似乎不可能让它学习括号语言。但是使用 2 个注意力块,学习过程似乎会收敛——至少在提供了大约 1000 万个示例之后(并且,与变压器网络通常情况下一样,展示更多示例似乎只会降低其性能)。
So with this network, we can do the analog of what ChatGPT does, and ask for probabilities for what the next token should be—in a parenthesis sequence:因此,通过这个网络,我们可以做类似于 ChatGPT 的模拟,并询问下一个标记应该是什么的概率—在括号序列中:
And in the first case, the network is “pretty sure” that the sequence can’t end here—which is good, because if it did, the parentheses would be left unbalanced. In the second case, however, it “correctly recognizes” that the sequence can end here, though it also “points out” that it’s possible to “start again”, putting down a “(”, presumably with a “)” to follow. But, oops, even with its 400,000 or so laboriously trained weights, it says there’s a 15% probability to have “)” as the next token—which isn’t right, because that would necessarily lead to an unbalanced parenthesis.在第一种情况下,网络“非常确定”序列不能在这里结束——这是好事,因为如果结束了,括号就会不平衡。然而,在第二种情况下,它“正确地识别”序列可以在这里结束,尽管它也“指出”可以“重新开始”,放下一个“(”,接着可能会有一个“)”。但是,哎呀,即使经过了大约 40 万个辛苦训练的权重,它说下一个标记是“)”的概率为 15%——这是不正确的,因为这必然会导致括号不平衡。
Here’s what we get if we ask the network for the highest-probability completions for progressively longer sequences of (’s:这是我们向网络请求逐渐变长序列的最高概率完成时得到的结果:
And, yes, up to a certain length the network does just fine. But then it starts failing. It’s a pretty typical kind of thing to see in a “precise” situation like this with a neural net (or with machine learning in general). Cases that a human “can solve in a glance” the neural net can solve too. But cases that require doing something “more algorithmic” (e.g. explicitly counting parentheses to see if they’re closed) the neural net tends to somehow be “too computationally shallow” to reliably do. (By the way, even the full current ChatGPT has a hard time correctly matching parentheses in long sequences.)而且,一直到某个长度,网络表现得很好。但是然后它开始失败。这是在像这样的“精确”情况下与神经网络(或者机器学习一般)看到的一种相当典型的情况。人类“一眼就能解决”的情况,神经网络也能解决。但是那些需要进行“更算法化”操作的情况(例如,明确计算括号以查看它们是否关闭),神经网络往往在“计算上太浅”而无法可靠地执行。 (顺便说一句,即使是完整的当前 ChatGPT 在长序列中也很难正确匹配括号。)
So what does this mean for things like ChatGPT and the syntax of a language like English? The parenthesis language is “austere”—and much more of an “algorithmic story”. But in English it’s much more realistic to be able to “guess” what’s grammatically going to fit on the basis of local choices of words and other hints. And, yes, the neural net is much better at this—even though perhaps it might miss some “formally correct” case that, well, humans might miss as well. But the main point is that the fact that there’s an overall syntactic structure to the language—with all the regularity that implies—in a sense limits “how much” the neural net has to learn. And a key “natural-science-like” observation is that the transformer architecture of neural nets like the one in ChatGPT seems to successfully be able to learn the kind of nested-tree-like syntactic structure that seems to exist (at least in some approximation) in all human languages.那么,这对于像 ChatGPT 这样的东西和英语这样的语言的语法意味着什么呢?括号语言是“简朴”的,更像是一个“算法故事”。但在英语中,根据单词的本地选择和其他线索,“猜测”语法上可能适合的内容要容易得多。是的,神经网络在这方面要好得多,尽管也许会错过一些“形式上正确”的情况,而这些情况人类也可能会错过。但主要观点是,语言存在整体句法结构,带有所有这意味着的规律性,从某种意义上限制了神经网络需要学习的“程度”。一个关键的“类自然科学”的观察是,像 ChatGPT 中的那种变压器架构的神经网络似乎成功地学会了类似嵌套树状的句法结构,这种结构似乎存在于所有人类语言中(至少在某种程度上)。
Syntax provides one kind of constraint on language. But there are clearly more. A sentence like “Inquisitive electrons eat blue theories for fish” is grammatically correct but isn’t something one would normally expect to say, and wouldn’t be considered a success if ChatGPT generated it—because, well, with the normal meanings for the words in it, it’s basically meaningless.语法对语言施加了一种约束。但显然还有更多。像“好奇的电子吃蓝色的理论来换取鱼”这样的句子在语法上是正确的,但并不是人们通常会说的,如果 ChatGPT 生成了这样的句子,也不会被认为是成功的——因为,嗯,用其中的词的正常含义来看,它基本上是毫无意义的。
But is there a general way to tell if a sentence is meaningful? There’s no traditional overall theory for that. But it’s something that one can think of ChatGPT as having implicitly “developed a theory for” after being trained with billions of (presumably meaningful) sentences from the web, etc.但是有没有一种通用的方法来判断一句话是否有意义呢?目前还没有传统的总体理论。但可以将 ChatGPT 看作是在通过在网上等地训练时隐含地“发展了一个理论”,因为它接受了数十亿条(据推测有意义的)句子的训练。
What might this theory be like? Well, there’s one tiny corner that’s basically been known for two millennia, and that’s logic. And certainly in the syllogistic form in which Aristotle discovered it, logic is basically a way of saying that sentences that follow certain patterns are reasonable, while others are not. Thus, for example, it’s reasonable to say “All X are Y. This is not Y, so it’s not an X” (as in “All fishes are blue. This is not blue, so it’s not a fish.”). And just as one can somewhat whimsically imagine that Aristotle discovered syllogistic logic by going (“machine-learning-style”) through lots of examples of rhetoric, so too one can imagine that in the training of ChatGPT it will have been able to “discover syllogistic logic” by looking at lots of text on the web, etc. (And, yes, while one can therefore expect ChatGPT to produce text that contains “correct inferences” based on things like syllogistic logic, it’s a quite different story when it comes to more sophisticated formal logic—and I think one can expect it to fail here for the same kind of reasons it fails in parenthesis matching.)这个理论可能是什么样的呢?嗯,有一个微小的角落基本上已经为两千年所熟知,那就是逻辑。当然,在亚里士多德发现的三段论形式中,逻辑基本上是一种说法,即遵循某些模式的句子是合理的,而其他句子则不合理。因此,例如,说“所有的 X 都是 Y。这不是 Y,所以它不是 X”是合理的(如“所有的鱼都是蓝色。这不是蓝色,所以它不是鱼。”)。就像人们可以略带幽默地想象亚里士多德通过大量修辞例子(“机器学习风格”)发现了三段论逻辑一样,人们也可以想象在 ChatGPT 的训练中,它将能够通过查看大量网络文本等方式“发现三段论逻辑”。(是的,虽然人们可以期望 ChatGPT 生成包含基于三段论逻辑等内容的“正确推理”的文本,但当涉及更复杂的形式逻辑时,情况就大不相同了——我认为人们可以期望它在这里失败,原因与它在括号匹配中失败的原因相同。)
But beyond the narrow example of logic, what can be said about how to systematically construct (or recognize) even plausibly meaningful text? Yes, there are things like Mad Libs that use very specific “phrasal templates”. But somehow ChatGPT implicitly has a much more general way to do it. And perhaps there’s nothing to be said about how it can be done beyond “somehow it happens when you have 175 billion neural net weights”. But I strongly suspect that there’s a much simpler and stronger story.但除了逻辑的狭窄示例之外,关于如何系统地构建(或识别)甚至可能有意义的文本,还能说些什么呢?是的,有像 Mad Libs 这样使用非常具体的“短语模板”的东西。但不知何故,ChatGPT 隐含地有一种更一般的方法来做到这一点。也许关于如何做到这一点没有什么可说的,除了“当你有 1750 亿个神经网络权重时,它会以某种方式发生”。但我强烈怀疑,这里有一个更简单更有力的故事。
Meaning Space and Semantic Laws of Motion意义空间和语义运动
We discussed above that inside ChatGPT any piece of text is effectively represented by an array of numbers that we can think of as coordinates of a point in some kind of “linguistic feature space”. So when ChatGPT continues a piece of text this corresponds to tracing out a trajectory in linguistic feature space. But now we can ask what makes this trajectory correspond to text we consider meaningful. And might there perhaps be some kind of “semantic laws of motion” that define—or at least constrain—how points in linguistic feature space can move around while preserving “meaningfulness”?我们上面讨论过,在 ChatGPT 内部,任何文本片段都可以有效地用一组数字表示,我们可以将其视为某种“语言特征空间”中点的坐标。因此,当 ChatGPT 继续一段文本时,这相当于在语言特征空间中描绘出一条轨迹。但现在我们可以问一下,是什么使得这条轨迹对应于我们认为有意义的文本。也许可能存在某种“语义运动规律”,定义了——或者至少约束了——在保持“有意义性”时,语言特征空间中的点如何移动?
So what is this linguistic feature space like? Here’s an example of how single words (here, common nouns) might get laid out if we project such a feature space down to 2D:那么这种语言特征空间是什么样的呢?以下是一个示例,展示了如果我们将这样一个特征空间投影到二维空间时,单词(这里是普通名词)可能会被排列的方式:
We saw another example above based on words representing plants and animals. But the point in both cases is that “semantically similar words” are placed nearby.我们在上面看到了另一个例子,基于代表植物和动物的词语。但两种情况的重点都是“语义相似的词语”被放在附近。
As another example, here’s how words corresponding to different parts of speech get laid out:作为另一个例子,这里是不同词性对应的单词如何排列的示例:
Of course, a given word doesn’t in general just have “one meaning” (or necessarily correspond to just one part of speech). And by looking at how sentences containing a word lay out in feature space, one can often “tease apart” different meanings—as in the example here for the word “crane” (bird or machine?):
当然,一个给定的词通常不只有“一个意思”(或者必然对应于一个词性)。通过观察包含一个词的句子在特征空间中的布局,人们通常可以“区分出”不同的意思,就像这里的例子中的“起重机”(鸟还是机器?)。
OK, so it’s at least plausible that we can think of this feature space as placing “words nearby in meaning” close in this space. But what kind of additional structure can we identify in this space? Is there for example some kind of notion of “parallel transport” that would reflect “flatness” in the space? One way to get a handle on that is to look at analogies:
好的,所以至少可以认为我们可以将这个特征空间看作是将“意思相近的词语”放在这个空间中靠近的地方。但是在这个空间中我们能够识别出什么样的额外结构呢?例如,是否存在某种“平行传输”的概念,可以反映空间中的“平坦性”?了解这一点的一种方法是看类比:
And, yes, even when we project down to 2D, there’s often at least a “hint of flatness”, though it’s certainly not universally seen.是的,即使当我们投影到二维时,通常至少会有一丝“平面的暗示”,尽管这并非普遍可见。
So what about trajectories? We can look at the trajectory that a prompt for ChatGPT follows in feature space—and then we can see how ChatGPT continues that:那么轨迹呢?我们可以看一下 ChatGPT 提示在特征空间中的轨迹,然后我们可以看到 ChatGPT 如何继续:
There’s certainly no “geometrically obvious” law of motion here. And that’s not at all surprising; we fully expect this to be a considerably more complicated story. And, for example, it’s far from obvious that even if there is a “semantic law of motion” to be found, what kind of embedding (or, in effect, what “variables”) it’ll most naturally be stated in.这里显然没有“几何上明显”的运动定律。这一点并不令人惊讶;我们完全预料到这将是一个相当复杂的故事。例如,即使有“语义运动定律”可寻,也远非明显,最自然地陈述它将采用何种嵌入(或者说,采用何种“变量”)。
In the picture above, we’re showing several steps in the “trajectory”—where at each step we’re picking the word that ChatGPT considers the most probable (the “zero temperature” case). But we can also ask what words can “come next” with what probabilities at a given point:在上面的图片中,我们展示了“轨迹”中的几个步骤,在每一步中,我们选择 ChatGPT 认为最有可能的单词(“零温度”情况)。但我们也可以询问在某一点上下一个可能出现的单词是什么,以及概率是多少:
And what we see in this case is that there’s a “fan” of high-probability words that seems to go in a more or less definite direction in feature space. What happens if we go further? Here are the successive “fans” that appear as we “move along” the trajectory:
在这种情况下,我们看到的是一组高概率词的“扇形”,似乎在特征空间中朝着一个或多或少明确的方向前进。如果我们继续前进会发生什么?以下是随着我们“沿着”轨迹前进而出现的连续“扇形”:
Here’s a 3D representation, going for a total of 40 steps:
这是一个 3D 表示,共有 40 个步骤:
And, yes, this seems like a mess—and doesn’t do anything to particularly encourage the idea that one can expect to identify “mathematical-physics-like” “semantic laws of motion” by empirically studying “what ChatGPT is doing inside”. But perhaps we’re just looking at the “wrong variables” (or wrong coordinate system) and if only we looked at the right one, we’d immediately see that ChatGPT is doing something “mathematical-physics-simple” like following geodesics. But as of now, we’re not ready to “empirically decode” from its “internal behavior” what ChatGPT has “discovered” about how human language is “put together”.
而且,是的,这看起来像一团乱麻,并且并没有特别鼓励人们期望通过经验研究“ChatGPT 内部正在做什么”来识别“类似数学物理的”“语义运动定律”的想法。但也许我们只是在看“错误的变量”(或错误的坐标系),如果我们只看对了,我们会立刻看到 ChatGPT 正在做一些“数学物理简单”的事情,比如遵循测地线。但就目前而言,我们还没有准备好从其“内部行为”中“经验性解码”ChatGPT 已经“发现”的有关人类语言“如何组合”的信息。
Semantic Grammar and the Power of Computational Language语义语法与计算语言的力量
What does it take to produce “meaningful human language”? In the past, we might have assumed it could be nothing short of a human brain. But now we know it can be done quite respectably by the neural net of ChatGPT. Still, maybe that’s as far as we can go, and there’ll be nothing simpler—or more human understandable—that will work. But my strong suspicion is that the success of ChatGPT implicitly reveals an important “scientific” fact: that there’s actually a lot more structure and simplicity to meaningful human language than we ever knew—and that in the end there may be even fairly simple rules that describe how such language can be put together.生产“有意义的人类语言”需要什么?过去,我们可能认为这只能依赖人类大脑。但现在我们知道,ChatGPT 的神经网络可以相当体面地完成这项任务。也许这已经是我们能够达到的极限了,不会有更简单或更易理解的方法。但我强烈怀疑 ChatGPT 的成功隐含着一个重要的“科学”事实:有意义的人类语言实际上比我们以往所知道的要更具结构和简单性,最终可能会有相当简单的规则来描述如何组织这种语言。
As we mentioned above, syntactic grammar gives rules for how words corresponding to things like different parts of speech can be put together in human language. But to deal with meaning, we need to go further. And one version of how to do this is to think about not just a syntactic grammar for language, but also a semantic one.正如我们上面提到的,句法语法规定了词语如何组合成人类语言中不同词类的事物。但要处理意义,我们需要更进一步。如何做到这一点的一个版本是不仅考虑语言的句法语法,还要考虑语义语法。
For purposes of syntax, we identify things like nouns and verbs. But for purposes of semantics, we need “finer gradations”. So, for example, we might identify the concept of “moving”, and the concept of an “object” that “maintains its identity independent of location”. There are endless specific examples of each of these “semantic concepts”. But for the purposes of our semantic grammar, we’ll just have some general kind of rule that basically says that “objects” can “move”. There’s a lot to say about how all this might work (some of which I’ve said before). But I’ll content myself here with just a few remarks that indicate some of the potential path forward.为了句法目的,我们识别诸如名词和动词之类的事物。但是为了语义目的,我们需要“更精细的分级”。因此,例如,我们可能确定“移动”的概念,以及“保持其独立于位置的身份”的“对象”的概念。每个这些“语义概念”都有无数具体的例子。但是为了我们语义语法的目的,我们将只有一些基本规则,基本上是说“对象”可以“移动”。关于所有这些可能如何运作的内容有很多要说的(其中一些我以前已经说过)。但是在这里我将满足于只做一些表明前进方向潜力的评论。
It’s worth mentioning that even if a sentence is perfectly OK according to the semantic grammar, that doesn’t mean it’s been realized (or even could be realized) in practice. “The elephant traveled to the Moon” would doubtless “pass” our semantic grammar, but it certainly hasn’t been realized (at least yet) in our actual world—though it’s absolutely fair game for a fictional world.值得一提的是,即使一句话在语义语法上完全没问题,也不意味着它已经在实践中被实现(甚至可能被实现)。“大象去月球旅行”毫无疑问会“通过”我们的语义语法,但它在我们的现实世界中肯定还没有被实现(至少目前还没有)—尽管它绝对适用于虚构世界。
When we start talking about “semantic grammar” we’re soon led to ask “What’s underneath it?” What “model of the world” is it assuming? A syntactic grammar is really just about the construction of language from words. But a semantic grammar necessarily engages with some kind of “model of the world”—something that serves as a “skeleton” on top of which language made from actual words can be layered.当我们开始谈论“语义语法”时,我们很快就会问:“它的底层是什么?”它假设了什么“世界模型”?句法语法实际上只是关于从单词构建语言。但语义语法必然涉及某种“世界模型”——一种作为“骨架”的东西,可以在其上叠加由实际单词构成的语言。
Until recent times, we might have imagined that (human) language would be the only general way to describe our “model of the world”. Already a few centuries ago there started to be formalizations of specific kinds of things, based particularly on mathematics. But now there’s a much more general approach to formalization: computational language.直到最近,我们可能会想象(人类)语言将是描述我们“世界模型”的唯一一般方式。 几个世纪前,已经开始对特定类型的事物进行形式化,特别是基于数学。 但现在有一种更一般的形式化方法:计算语言。
And, yes, that’s been my big project over the course of more than four decades (as now embodied in the Wolfram Language): to develop a precise symbolic representation that can talk as broadly as possible about things in the world, as well as abstract things that we care about. And so, for example, we have symbolic representations for cities and molecules and images and neural networks, and we have built-in knowledge about how to compute about those things.是的,这已经是我在四十多年的时间里的主要项目(现在体现在 Wolfram 语言中):开发一个精确的符号表示,可以尽可能广泛地谈论世界上的事物,以及我们关心的抽象事物。例如,我们有关于城市、分子、图像和神经网络的符号表示,以及关于如何计算这些事物的内置知识。
And, after decades of work, we’ve covered a lot of areas in this way. But in the past, we haven’t particularly dealt with “everyday discourse”. In “I bought two pounds of apples” we can readily represent (and do nutrition and other computations on) the “two pounds of apples”. But we don’t (quite yet) have a symbolic representation for “I bought”.经过几十年的工作,我们以这种方式涵盖了许多领域。但在过去,我们并没有特别处理“日常话语”。在“我买了两磅苹果”中,我们可以轻松地表示(并对营养和其他计算进行)“两磅苹果”。但是我们(还没有)对“我买了”有符号表示。
It’s all connected to the idea of semantic grammar—and the goal of having a generic symbolic “construction kit” for concepts, that would give us rules for what could fit together with what, and thus for the “flow” of what we might turn into human language.这一切都与语义语法的概念相关联,目标是拥有一个通用的符号“构建工具包”来表达概念,这将为我们提供什么可以与什么搭配的规则,从而为我们可能转化为人类语言的“流动”提供规则。
But let’s say we had this “symbolic discourse language”. What would we do with it? We could start off doing things like generating “locally meaningful text”. But ultimately we’re likely to want more “globally meaningful” results—which means “computing” more about what can actually exist or happen in the world (or perhaps in some consistent fictional world).但假设我们有这种“象征性话语语言”。我们会用它做什么呢?我们可以开始做一些像生成“本地有意义的文本”之类的事情。但最终,我们可能会想要更多“全球有意义”的结果——这意味着“计算”更多关于世界中实际存在或可能发生的事情(或者在某个一致的虚构世界中)。
Right now in Wolfram Language we have a huge amount of built-in computational knowledge about lots of kinds of things. But for a complete symbolic discourse language we’d have to build in additional “calculi” about general things in the world: if an object moves from A to B and from B to C, then it’s moved from A to C, etc.现在在 Wolfram 语言中,我们拥有大量关于各种事物的内置计算知识。但是,要构建一个完整的符号性论述语言,我们需要构建有关世界一般事物的额外“演算法”:如果一个物体从 A 移动到 B,再从 B 移动到 C,那么它就从 A 移动到 C,等等。
Given a symbolic discourse language we might use it to make “standalone statements”. But we can also use it to ask questions about the world, “Wolfram|Alpha style”. Or we can use it to state things that we “want to make so”, presumably with some external actuation mechanism. Or we can use it to make assertions—perhaps about the actual world, or perhaps about some specific world we’re considering, fictional or otherwise.给定一个象征性的话语语言,我们可以用它来做“独立的陈述”。但我们也可以用它来问世界的问题,“沃尔夫勒姆|阿尔法风格”。或者我们可以用它来陈述我们“想要实现的事情”,可能有一些外部的激活机制。或者我们可以用它来做断言——也许是关于实际世界,或者是关于我们正在考虑的某个特定世界,无论是虚构的还是其他。
Human language is fundamentally imprecise, not least because it isn’t “tethered” to a specific computational implementation, and its meaning is basically defined just by a “social contract” between its users. But computational language, by its nature, has a certain fundamental precision—because in the end what it specifies can always be “unambiguously executed on a computer”. Human language can usually get away with a certain vagueness. (When we say “planet” does it include exoplanets or not, etc.?) But in computational language we have to be precise and clear about all the distinctions we’re making.人类语言基本上是不精确的,主要是因为它并没有“绑定”到特定的计算实现,其含义基本上只是由其用户之间的“社会契约”定义的。但是,计算语言本质上具有一定的基本精度——因为最终它所指定的内容总是可以“在计算机上明确执行”的。人类语言通常可以容忍一定的模糊性。(当我们说“行星”时,是否包括系外行星等?)但在计算语言中,我们必须对我们所做的所有区别保持精确和清晰。
It’s often convenient to leverage ordinary human language in making up names in computational language. But the meanings they have in computational language are necessarily precise—and might or might not cover some particular connotation in typical human language usage.在计算语言中编造名称时,通常可以方便地利用普通人类语言。但是它们在计算语言中的含义必须是精确的,可能会或可能不会涵盖一些典型人类语言用法中的特定内涵。
How should one figure out the fundamental “ontology” suitable for a general symbolic discourse language? Well, it’s not easy. Which is perhaps why little has been done since the primitive beginnings Aristotle made more than two millennia ago. But it really helps that today we now know so much about how to think about the world computationally (and it doesn’t hurt to have a “fundamental metaphysics” from our Physics Project and the idea of the ruliad).一个人应该如何找出适用于一般符号论述语言的基本“本体论”?嗯,这并不容易。这也许是为什么自两千多年前亚里士多德的原始开始以来,很少有人做过这样的事情。但今天我们现在知道如何计算地思考世界的方式,这确实有所帮助(而且从我们的物理项目和“规律”的概念中得到“基本形而上学”也是有益的)。
But what does all this mean in the context of ChatGPT? From its training ChatGPT has effectively “pieced together” a certain (rather impressive) quantity of what amounts to semantic grammar. But its very success gives us a reason to think that it’s going to be feasible to construct something more complete in computational language form. And, unlike what we’ve so far figured out about the innards of ChatGPT, we can expect to design the computational language so that it’s readily understandable to humans.但在 ChatGPT 的背景下,所有这些意味着什么呢?从其训练 ChatGPT 有效地“拼凑”了某种(相当令人印象深刻的)语义语法数量。但它的成功使我们有理由认为,构建更完整的计算语言形式是可行的。与我们迄今为止对 ChatGPT 内部的了解不同,我们可以期望设计计算语言,使其对人类来说易于理解。
When we talk about semantic grammar, we can draw an analogy to syllogistic logic. At first, syllogistic logic was essentially a collection of rules about statements expressed in human language. But (yes, two millennia later) when formal logic was developed, the original basic constructs of syllogistic logic could now be used to build huge “formal towers” that include, for example, the operation of modern digital circuitry. And so, we can expect, it will be with more general semantic grammar. At first, it may just be able to deal with simple patterns, expressed, say, as text. But once its whole computational language framework is built, we can expect that it will be able to be used to erect tall towers of “generalized semantic logic”, that allow us to work in a precise and formal way with all sorts of things that have never been accessible to us before, except just at a “ground-floor level” through human language, with all its vagueness.当我们谈论语义语法时,我们可以类比三段论逻辑。起初,三段论逻辑本质上是关于用人类语言表达的陈述的一系列规则。但(是的,两千年后),当形式逻辑被发展出来时,三段论逻辑的最初基本结构现在可以用来构建包括现代数字电路操作在内的巨大“形式塔”。因此,我们可以期待更一般的语义语法也会如此。起初,它可能只能处理简单的模式,比如以文本形式表达。但一旦建立了整个计算语言框架,我们可以期待它将能够用于建立“广义语义逻辑”的高塔,使我们能够以精确和正式的方式处理各种以前从未接触过的事物,除了通过人类语言的“地面层”方式,带有所有的模糊性。
We can think of the construction of computational language—and semantic grammar—as representing a kind of ultimate compression in representing things. Because it allows us to talk about the essence of what’s possible, without, for example, dealing with all the “turns of phrase” that exist in ordinary human language. And we can view the great strength of ChatGPT as being something a bit similar: because it too has in a sense “drilled through” to the point where it can “put language together in a semantically meaningful way” without concern for different possible turns of phrase.我们可以将计算语言的构建和语义语法看作是在表示事物时的一种终极压缩。因为它使我们能够谈论可能性的本质,例如,不必处理存在于普通人类语言中的所有“成语”。我们可以将 ChatGPT 的巨大优势视为有些类似:因为它也在某种意义上“深入挖掘”到了可以“以语义上有意义的方式组合语言”的程度,而不必考虑不同可能的成语。
So what would happen if we applied ChatGPT to underlying computational language? The computational language can describe what’s possible. But what can still be added is a sense of “what’s popular”—based for example on reading all that content on the web. But then—underneath—operating with computational language means that something like ChatGPT has immediate and fundamental access to what amount to ultimate tools for making use of potentially irreducible computations. And that makes it a system that can not only “generate reasonable text”, but can expect to work out whatever can be worked out about whether that text actually makes “correct” statements about the world—or whatever it’s supposed to be talking about.那么,如果我们将 ChatGPT 应用于基础计算语言,会发生什么呢?计算语言可以描述可能发生的事情。但可以添加的是一种“流行”的感觉——例如基于阅读网络上的所有内容。但在底层,使用计算语言意味着像 ChatGPT 这样的东西立即且根本地可以访问那些用于利用潜在不可简化计算的终极工具。这使得它成为一个系统,不仅可以“生成合理的文本”,而且可以期望解决关于文本是否实际提出了关于世界的“正确”陈述,或者关于它所谈论的任何内容的问题。
So … What Is ChatGPT Doing, and Why Does It Work?那么...ChatGPT 在做什么,为什么它有效呢?
The basic concept of ChatGPT is at some level rather simple. Start from a huge sample of human-created text from the web, books, etc. Then train a neural net to generate text that’s “like this”. And in particular, make it able to start from a “prompt” and then continue with text that’s “like what it’s been trained with”.ChatGPT 的基本概念在某种程度上相当简单。从网络、书籍等人类创作文本的大量样本开始。然后训练神经网络生成“类似于这样”的文本。特别是,使其能够从“提示”开始,然后继续生成“与其训练内容相似”的文本。
As we’ve seen, the actual neural net in ChatGPT is made up of very simple elements—though billions of them. And the basic operation of the neural net is also very simple, consisting essentially of passing input derived from the text it’s generated so far “once through its elements” (without any loops, etc.) for every new word (or part of a word) that it generates.正如我们所看到的,ChatGPT 中的实际神经网络由非常简单的元素组成,尽管有数十亿个。神经网络的基本操作也非常简单,基本上是将从迄今为止生成的文本中得出的输入“一次通过其元素”(没有任何循环等)传递给每个生成的新单词(或单词的一部分)。
But the remarkable—and unexpected—thing is that this process can produce text that’s successfully “like” what’s out there on the web, in books, etc. And not only is it coherent human language, it also “says things” that “follow its prompt” making use of content it’s “read”. It doesn’t always say things that “globally make sense” (or correspond to correct computations)—because (without, for example, accessing the “computational superpowers” of Wolfram|Alpha) it’s just saying things that “sound right” based on what things “sounded like” in its training material.但值得注意的——而且出乎意料的——是,这个过程可以生成成功“类似于”网络上、书籍等内容的文本。它不仅是连贯的人类语言,还会“说出”符合其提示的内容,利用其“阅读过”的内容。它并不总是说出“全局意义上有意义的”内容(或与正确计算相符)——因为(例如,没有访问 Wolfram|Alpha 的“计算超能力”),它只是根据其训练材料中的内容“听起来正确”而已。
The specific engineering of ChatGPT has made it quite compelling. But ultimately (at least until it can use outside tools) ChatGPT is “merely” pulling out some “coherent thread of text” from the “statistics of conventional wisdom” that it’s accumulated. But it’s amazing how human-like the results are. And as I’ve discussed, this suggests something that’s at least scientifically very important: that human language (and the patterns of thinking behind it) are somehow simpler and more “law like” in their structure than we thought. ChatGPT has implicitly discovered it. But we can potentially explicitly expose it, with semantic grammar, computational language, etc.ChatGPT 的具体工程使其变得非常引人注目。但最终(至少在它可以使用外部工具之前),ChatGPT“仅仅”是从它积累的“常识统计数据”中提取出一些“连贯的文本线索”。但令人惊讶的是结果有多么类似人类。正如我所讨论的,这表明了至少在科学上非常重要的一点:人类语言(以及背后的思维模式)在结构上比我们想象的要简单和更“法则化”。ChatGPT 已经隐含地发现了这一点。但我们可以潜在地通过语义语法、计算语言等明确地揭示它。
What ChatGPT does in generating text is very impressive—and the results are usually very much like what we humans would produce. So does this mean ChatGPT is working like a brain? Its underlying artificial-neural-net structure was ultimately modeled on an idealization of the brain. And it seems quite likely that when we humans generate language many aspects of what’s going on are quite similar.ChatGPT 在生成文本方面的表现非常令人印象深刻,而且结果通常非常类似于我们人类会产生的内容。那么这是否意味着 ChatGPT 就像大脑一样工作呢?它的基础人工神经网络结构最终是根据对大脑的理想化建模而得出的。而且很可能当我们人类生成语言时,许多方面的情况都是非常相似的。
When it comes to training (AKA learning) the different “hardware” of the brain and of current computers (as well as, perhaps, some undeveloped algorithmic ideas) forces ChatGPT to use a strategy that’s probably rather different (and in some ways much less efficient) than the brain. And there’s something else as well: unlike even in typical algorithmic computation, ChatGPT doesn’t internally “have loops” or “recompute on data”. And that inevitably limits its computational capability—even with respect to current computers, but definitely with respect to the brain.谈到训练(又称学习)大脑和当前计算机的不同“硬件”(以及也许一些尚未开发的算法思想),ChatGPT 被迫采用一种可能相当不同(在某些方面效率远低于)大脑的策略。还有另一点:与典型的算法计算甚至不同,ChatGPT 内部并没有“循环”或“在数据上重新计算”。这不可避免地限制了它的计算能力——即使与当前计算机相比,但绝对是相对于大脑。
It’s not clear how to “fix that” and still maintain the ability to train the system with reasonable efficiency. But to do so will presumably allow a future ChatGPT to do even more “brain-like things”. Of course, there are plenty of things that brains don’t do so well—particularly involving what amount to irreducible computations. And for these both brains and things like ChatGPT have to seek “outside tools”—like Wolfram Language.目前尚不清楚如何“修复”并仍然保持以合理效率训练系统的能力。但这样做很可能会使未来的 ChatGPT 能够做更多“类似大脑的事情”。当然,大脑不擅长做很多事情,特别是涉及到不可简化计算的情况。对于这些情况,大脑和像 ChatGPT 这样的系统都需要寻找“外部工具”,比如 Wolfram 语言。
But for now it’s exciting to see what ChatGPT has already been able to do. At some level it’s a great example of the fundamental scientific fact that large numbers of simple computational elements can do remarkable and unexpected things. But it also provides perhaps the best impetus we’ve had in two thousand years to understand better just what the fundamental character and principles might be of that central feature of the human condition that is human language and the processes of thinking behind it.但是目前看到 ChatGPT 已经能够做到的事情令人兴奋。在某种程度上,这是一个很好的例子,说明大量简单的计算元素可以做出非凡和意想不到的事情。但它也可能提供了我们在两千年来更好地理解人类语言这一人类条件的核心特征和原则可能是什么的最好动力。
Thanks 谢谢
I’ve been following the development of neural nets now for about 43 years, and during that time I’ve interacted with many people about them. Among them—some from long ago, some from recently, and some across many years—have been: Giulio Alessandrini, Dario Amodei, Etienne Bernard, Taliesin Beynon, Sebastian Bodenstein, Greg Brockman, Jack Cowan, Pedro Domingos, Jesse Galef, Roger Germundsson, Robert Hecht-Nielsen, Geoff Hinton, John Hopfield, Yann LeCun, Jerry Lettvin, Jerome Louradour, Marvin Minsky, Eric Mjolsness, Cayden Pierce, Tomaso Poggio, Matteo Salvarezza, Terry Sejnowski, Oliver Selfridge, Gordon Shaw, Jonas Sjöberg, Ilya Sutskever, Gerry Tesauro and Timothee Verdier. For help with this piece, I’d particularly like to thank Giulio Alessandrini and Brad Klee.我已经关注神经网络的发展大约 43 年了,在这段时间里,我与许多人交流过关于它们的话题。其中一些人——有些是很久以前的,有些是最近的,还有一些是跨越多年的——包括:Giulio Alessandrini、Dario Amodei、Etienne Bernard、Taliesin Beynon、Sebastian Bodenstein、Greg Brockman、Jack Cowan、Pedro Domingos、Jesse Galef、Roger Germundsson、Robert Hecht-Nielsen、Geoff Hinton、John Hopfield、Yann LeCun、Jerry Lettvin、Jerome Louradour、Marvin Minsky、Eric Mjolsness、Cayden Pierce、Tomaso Poggio、Matteo Salvarezza、Terry Sejnowski、Oliver Selfridge、Gordon Shaw、Jonas Sjöberg、Ilya Sutskever、Gerry Tesauro 和 Timothee Verdier。在撰写这篇文章时,我特别要感谢 Giulio Alessandrini 和 Brad Klee 的帮助。
Additional Resources 额外资源
Cite this asStephen Wolfram (2023), "What Is ChatGPT Doing ... and Why Does It Work?," Stephen Wolfram Writings. writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work.斯蒂芬·沃尔夫勒姆(2023 年),《ChatGPT 在做什么...为什么有效?》,斯蒂芬·沃尔夫勒姆著作。writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work.
- Author:NotionNext
- URL:https://tangly1024.com/article/ChatGPT%20%E6%AD%A3%E5%9C%A8%E5%81%9A%E4%BB%80%E4%B9%88%E2%80%A6%E2%80%A6%E4%B8%BA%E4%BB%80%E4%B9%88%E5%AE%83%E6%9C%89%E6%95%88%EF%BC%9F(%E4%B8%8B)
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!