原文出处:www.mikeash.com/pyblog/friday-qa-2013-10-11-why-registers-are-fast-and-ram-is-slow.html

Why Registers Are Fast and RAM Is Slow

by Mike Ash

In the previous article on ARM64, I mentioned that one advantage of the new architecture is the fact that it has twice as many registers, allowing code load data from RAM less often, which is much slower. Reader Daniel Hooper asks the natural question: just why is RAM so much slower than registers?

在上一篇关于ARM64的文章中,我提到新架构的一个优点是它具有两倍的寄存器,允许代码从RAM中加载数据的频率较低,这要慢得多。读者丹尼尔·胡珀问自然的问题:刚才为什么 RAM比寄存器这么慢得多?

Distance
Let's start with distance. It's not necessarily a big factor, but it's the most fun to analyze. RAM is farther away from the CPU than registers are, which can make it take longer to fetch data from it.

让我们从距离开始。它不一定是一个重要因素,但它是分析最有趣的。RAM比寄存器更远离CPU,这可能需要更长的时间才能从中获取数据。

Take a 3GHz processor as an extreme example. The speed of light is roughly one foot per nanosecond, or about 30cm per nanosecond for you metric folk. Light can only travel about four inches in time of a single clock cycle of this processor. That means a roundtrip signal can only get to a component that's two inches away or less, and that assumes that the hardware is perfect and able to transmit information at the speed of light in vacuum. For a desktop PC, that's pretty significant. However, it's much less important for an iPhone, where the clock speed is much lower (the 5S runs at 1.3GHz) and the RAM is right next to the CPU.

以3GHz处理器为例。光速大约是每纳秒一英尺,或者对于公尺人来说大约每厘秒30厘米。在该处理器的单个时钟周期内,光只能行进大约4英寸。这意味着往返信号只能到达距离不到两英寸的组件,并且假设硬件是完美的并且能够以真空中的光速传输信息。对于台式电脑来说,这非常重要。然而,对于iPhone来说,它的重要性要低得多,时钟速度要低得多(5S运行在1.3GHz),而RAM就在CPU的旁边。

Cost
Much as we might wish it wasn't, cost is always a factor. In software, when trying to make a program run fast, we don't go through the entire program and give it equal attention. Instead, we identify the hotspots that are most critical to performance, and give them the most attention. This makes the best use of our limited resources. Hardware is similar. Faster hardware is more expensive, and that expense is best spent where it'll make the most difference.

就像我们希望的那样,成本总是一个因素。在软件中,当试图使程序快速运行时,我们不会通过整个程序并给予同等的关注。相反,我们确定对性能最关键的热点,并给予他们最多的关注。这充分利用了我们有限的资源。硬件类似。更快的硬件更昂贵,并且最大限度地花费在最有效的地方。

Registers get used extremely frequently, and there aren't a lot of them. There are only about 6,000 bits of register data in an A7 (32 64-bit general-purpose registers plus 32 128-bit floating-point registers, and some miscellaneous ones). There are about 8 billion bits (1GB) of RAM in an iPhone 5S. It's worthwhile to spend a bunch of money making each register bit faster. There are literally a million times more RAM bits, and those eight billion bits pretty much have to be as cheap as possible if you want a $650 phone instead of a $6,500 phone.

寄存器非常频繁地使用,而且它们并不多。A7中只有大约6,000位寄存器数据(32个64位通用寄存器加32个128位浮点寄存器,还有一些杂项寄存器)。iPhone 5S中有大约80亿比特(1GB)的RAM。花一大笔钱让每个寄存器更快一点是值得的。实际上有多达百万倍的RAM位,如果你想要650美元的手机而不是6500美元的手机,那么80亿位必须要尽可能便宜。

Registers use an expensive design that can be read quickly. Reading a register bit is a matter of activating the right transistor and then waiting a short time for the register hardware to push the read line to the appropriate state.

寄存器非常频繁地使用,而且它们并不多。A7中只有大约6,000位寄存器数据(32个64位通用寄存器加32个128位浮点寄存器,还有一些杂项寄存器)。iPhone 5S中有大约80亿比特(1GB)的RAM。花一大笔钱让每个寄存器更快一点是值得的。实际上有多达百万倍的RAM位,如果你想要650美元的手机而不是6500美元的手机,那么80亿位必须要尽可能便宜。

Reading a RAM bit, on the other hand, is more involved. A bit in the DRAM found in any smartphone or PC consists of a single capacitor and a single transistor. The capacitors are extremely small, as you'd expect given that you can fit eight billion of them in your pocket. This means they carry a very small amount of charge, which makes it hard to measure. We like to think of digital circuits as dealing in ones and zeroes, but the analog world comes into play here. The read line is pre-charged to a level that's halfway between a one and a zero. Then the capacitor is connected to it, which either adds or drains a tiny amount of charge. An amplifier is used to push the charge towards zero or one. Once the charge in the line is sufficiently amplified, the result can be returned.

另一方面,读取RAM位更为复杂。在任何智能手机或PC中发现的DRAM中的一点由单个电容器和单个晶体管组成。电容器非常小,正如您所期望的那样,您可以在口袋中放入80亿个电容器。这意味着它们带有非常少量的电荷,这使得难以测量。我们喜欢将数字电路视为处理零和零,但模拟世界在这里发挥作用。读取线被预先充电到介于1和0之间的水平。然后将电容器连接到它,它可以增加或消耗少量电荷。放大器用于将电荷推向零或一。一旦线中的电荷被充分放大,就可以返回结果。

The fact that a RAM bit is only one transistor and one tiny capacitor makes it extremely cheap to manufacture. Register bits contain more parts and thereby cost much more.

RAM位只有一个晶体管和一个微型电容器,这使得制造起来非常便宜。寄存器位包含更多部分,因此成本更高。

There's also a lot more complexity involved just in figuring out what hardware to talk to with RAM because there's so much more of it. Reading from a register looks like:

在确定与RAM通信的硬件时还需要更多的复杂性,因为它有更多的内容。从寄存器中读取如下:

  1. Extract the relevant bits from the instruction.
  2. Put those bits onto the register file's read lines.
  3. Read the result.
  1. 从指令中提取相关位。
  2. 将这些位放在寄存器文件的读取线上。
  3. 阅读结果。

Reading from RAM looks like:

从RAM读取看起来像:

  1. Get the pointer to the data being loaded. (Said pointer is probably in a register. This already encompasses all of the work done above!)
  2. Send that pointer off to the MMU.
  3. The MMU translates the virtual address in the pointer to a physical address.
  4. Send the physical address to the memory controller.
  5. Memory controller figures out what bank of RAM the data is in and asks the RAM.
  6. The RAM figures out particular chunk the data is in, and asks that chunk.
  7. Step 6 may repeat a couple of more times before narrowing it down to a single array of cells.
  8. Load the data from the array.
  9. Send it back to the memory controller.
  10. Send it back to the CPU.
  11. Use it!
  1. 获取指向正在加载的数据的指针。(指针可能在寄存器中。这已经包含了上面完成的所有工作!)
  2. 将该指针发送到MMU。
  3. MMU将指针中的虚拟地址转换为物理地址。
  4. 将物理地址发送到内存控制器。
  5. 内存控制器计算出数据所在的RAM组并询问RAM。
  6. RAM计算出数据所在的特定块,并询问该块。
  7. 步骤6可以重复几次,然后将其缩小到单个单元阵列。
  8. 从阵列加载数据。
  9. 将其发送回内存控制器。
  10. 将其发送回CPU。
  11. 用它!

Whew.

Dealing With Slow RAM
That sums up why RAM is so much slower. But how does the CPU deal with such slowness? A RAM load is a single CPU instruction, but it can take potentially hundreds of CPU cycles to complete. How does the CPU deal with this?

总结了为什么RAM速度慢得多。但CPU如何处理这种缓慢?RAM加载是单CPU指令,但可能需要数百个CPU周期才能完成。CPU如何处理这个问题?

First, just how long does a CPU take to execute a single instruction? It can be tempting to just assume that a single instruction executes in a single cycle, but reality is, of course, much more complicated.

首先,CPU执行单个指令需要多长时间?假设单个指令在单个周期中执行可能很诱人,但实际情况当然要复杂得多。

Back in the good old days, when men wore their sheep proudly and the nation was undefeated in war, this was not a difficult question to answer. It wasn't one-instruction-one-cycle, but there was at least some clear correspondence. The Intel 4004, for example, took either 8 or 16 clock cycles to execute one instruction, depending on what that instruction was. Nice and understandable. Things gradually got more complex, with a wide variety of timings for different instructions. Older CPU manuals will give a list of how long each instruction takes to execute.

回到过去的好日子,当人们自豪地穿着他们的羊并且这个国家在战争中不败时,这不是一个难以回答的问题。这不是一个指令 - 一个周期,但至少有一些明确的对应关系。例如,英特尔4004需要8或16个时钟周期来执行一条指令,具体取决于该指令的内容。很好,也可以理解。事情逐渐变得更加复杂,各种指令的时间范围也各不相同。较旧的CPU手册将列出每条指令执行的时间。

Now? Not so simple.

现在?没那么简单。

Along with increasing clock rates, there's also been a long drive to increase the number of instructions that can be executed per clock cycle. Back in the day, that number was something like 0.1 of an instruction per clock cycle. These days, it's up around 3-4 on a good day. How does it perform this wizardry? When you have a billion or more transistors per chip, you can add in a lot of smarts. Although the CPU might be executing 3-4 instructions per clock cycle, that doesn't mean each instruction takes 1/4th of a clock cycle to execute. They still take at least one cycle, often more. What happens is that the CPU is able to maintain multiple instructions in flight at any given time. Each instruction can be broken up into pieces: load the instruction, decode it to see what it means, gather the input data, perform the computation, store the output data. Those can all happen on separate cycles.

随着时钟频率的增加,每隔一个时钟周期增加可执行指令的数量也会很长。在当天,这个数字类似于每个时钟周期0.1的指令。这些天,在美好的一天,它会在3-4左右上升。它是如何执行这种魔法的?当每个芯片有十亿或更多晶体管时,您可以添加许多智能。虽然CPU可能每个时钟周期执行3-4条指令,但这并不意味着每条指令需要1/4的时钟周期才能执行。他们仍然至少需要一个周期,通常更多。会发生什么是CPU能够在任何给定时间保持飞行中的多个指令。每条指令都可以分解为:加载指令,解码它以查看它的含义,收集输入数据,执行计算,存储输出数据。这些都可以在不同的周期中发生。

On any given CPU cycle, the CPU is doing a bunch of stuff simultaneously:

在任何给定的CPU周期中,CPU同时执行大量任务:

  1. Fetching potentially several instructions at once.
  2. Decoding potentially a completely different set of instructions.
  3. Fetching the data for potentially yet another different set of instructions.
  4. Performing computations for yet more instructions.
  5. Storing data for yet more instructions.
  1. 一次获取可能的几个指令。
  2. 可能解码一组完全不同的指令。
  3. 获取可能的另一组不同指令的数据。
  4. 执行更多指令的计算。
  5. 存储更多指令的数据。

But, you say, how could this possibly work? For example:

但是,你说,这怎么可能有效呢?例如:

    add x1, x1, x2add x1, x1, x3

These can't possibly execute in parallel like that! You need to be finished with the first instruction before you start the second!

这些不可能像这样并行执行!在开始第二条指令之前,您需要完成第一条指令!

It's true, that can't possibly work. That's where the smarts come in. The CPU is able to analyze the instruction stream and figure out which instructions depend on other instructions and shuffle things around. For example, if an instruction after those two adds doesn't depend on them, the CPU could end up executing that instruction before the second add, even though it comes later in the instruction stream. The ideal of 3-4 instructions per clock cycle can only be achieved in code that has a lot of independent instructions.

这是真的,这不可能奏效。这就是智能系统的用武之地.CPU能够分析指令流并找出哪些指令依赖于其他指令并随机改变。例如,如果在这两个添加之后的指令不依赖于它们,则CPU可能在第二次添加之前最终执行该指令,即使它稍后在指令流中。每个时钟周期3-4个指令的理想情况只能在具有大量独立指令的代码中实现。

What happens when you hit a memory load instruction? First of all, it is definitely going to take forever, relatively speaking. If you're really lucky and the value is in L1 cache, it'll only take a few cycles. If you're unlucky and it has to go all the way out to main RAM to find the data, it could take literally hundreds of cycles. There may be a lot of thumb-twiddling to be done.

点击内存加载指令会发生什么?首先,相对而言,它肯定会永远消失。如果你真的很幸运,价值在L1缓存中,它只需要几个周期。如果你运气不好而且必须一直到主RAM来查找数据,那么它可能需要数百个周期。可能会有很多拇指蠢蠢欲动。

The CPU will try not to twiddle its thumbs, because that's inefficient. First, it will try to anticipate. It may be able to spot that load instruction in advance, figure out what it's going to load, and initiate the load before it really starts executing the instruction. Second, it will keep executing other instructions while it waits, as long as it can. If there are instructions after the load instruction that don't depend on the data being loaded, they can still be executed. Finally, once it's executed everything it can and it absolutely cannot proceed any further without that data it's waiting on, it has little choice but to stall and wait for the data to come back from RAM..

CPU会尽量不要转动拇指,因为效率低下。首先,它会尝试预测。它可能能够提前发现该加载指令,找出它将要加载的内容,并在它真正开始执行指令之前启动加载。其次,它会在等待时继续执行其他指令,只要它可以。如果加载指令后面的指令不依赖于正在加载的数据,则仍然可以执行它们。最后,一旦它执行了它可以做的一切,如果没有它正在等待的数据,它绝对不能继续进行,它几乎没有选择,只能停止并等待数据从RAM返回。

Conclusion

  1. RAM is slow because there's a ton of it.
  2. That means you have to use designs that are cheaper, and cheaper means slower.
  3. Modern CPUs do crazy things internally and will happily execute your instruction stream in an order that's wildly different from how it appears in the code.
  4. That means that the first thing a CPU does while waiting for a RAM load is run other code.
  5. If all else fails, it'll just stop and wait, and wait, and wait, and wait.
  1. RAM很慢,因为它有很多。
  2. 这意味着你必须使用更便宜的设计,而更便宜意味着更慢。
  3. 现代CPU在内部做疯狂的事情,并且会愉快地执行您的指令流,其顺序与它在代码中的显示方式大不相同。
  4. 这意味着CPU在等待RAM加载时所做的第一件事就是运行其他代码。
  5. 如果一切都失败了,它就会停下来等待,等待,等待,等待。

为什么寄存器比RAM快相关推荐

  1. 寄存器、RAM、ROM、Flash

    单片机寄存器简述 寄存器详细请点这里 1.单片机寄存器就是单片机片内存储器(片内RAM)一部分,每一个都有地址.只不过这几个寄存器有特殊的作用,比如指令:MUL AB,这条指令用到两个寄存器A,B进行 ...

  2. 为什么寄存器处理数据的速度比内存快?

    为什么寄存器处理数据的速度比内存快? 计算机的存储层次(memory hierarchy)之中,寄存器(register)最快,内存其次,最慢的是硬盘. 同样都是晶体管存储设备,为什么寄存器比内存快呢 ...

  3. python移动图形工作站_让Python跑得更快

    原标题:让Python跑得更快 点击关注 异步图书,置顶公众号 每天与你分享 IT好书 技术干货 职场知识 Tips 参与文末话题讨论,即有机会获得异步图书一本. Python很容易学.你之所以阅读本 ...

  4. 寄存器、缓存、内存之间的关系和区别

    目录 关系 1.寄存器 2.缓存(Cache) 2.1.寄存器和缓存的区别 2.2.一级缓存和二级缓存 3.内存 3.1.只读存储器 ROM(Read Only Memory) 3.2.随机存储器 R ...

  5. 寄存器和存储器的区别?

    存储器在CPU外,一般指硬盘,U盘等可以在切断电源后保存资料的设备,容量一般比较大,缺点是读写速度都很慢,普通的机械硬盘读写速度一般是50MB/S左右.内存和寄存器就是为了解决存储器读写速度慢而产生的 ...

  6. I/O寄存器的边际效应

    尽管硬件寄存器和内存之间有很强的相似性, 程序员在存取 I/O 寄存器的时候还是要格外小心,避免被CPU(或者编译器)优化所迷惑, 因为它可能修改你期待的 I/O 行为. I/O 寄存器和 RAM 一 ...

  7. 让Python跑得更快

    点击关注 异步图书 ,置顶公众号 每天与你分享 IT好书 技术干货 职场知识 Tips 参与文末话题讨论,即有机会获得异步图书一本. Python很容易学.你之所以阅读本文可能是因为你的代码现在能够正 ...

  8. VS调试查看寄存器学习总结

    vs2008 调试时如何查看寄存器内容 调试菜单里没找到. 怎么调出来? 谢谢. debug→windows→registers 在调试状态,然后主菜单的调试->窗口->寄存器  或者按快 ...

  9. 理解CPU/寄存器/内存三者关系

    CPU/寄存器/内存 CPU,全名Central Processing Unit(中央处理器).这是一块超大规模的集成电路,包含上亿的晶体管,是一台计算机的运算核心(Core)和控制核心(Contro ...

最新文章

  1. AI项目成功的4要素
  2. mysql手机号保密数据类型_mysql中的数据类型
  3. GIS专业核心课程电子教材配套实验数据汇总(持续更新)
  4. 【C语言简单说】十:小结
  5. 柠檬工会_工会经营者
  6. [转载] JAVA 堆栈 堆 方法区 静态区 final static 内存分配 详解
  7. vs 2010 不显示解决方案文件
  8. 负载均衡集群HAProxy讲解篇
  9. python趣味编程100例-达人迷 Python趣味编程10例
  10. BAT算法工程师的成长之路,超详细的学习路线
  11. ​数字经济指数合集:各省、城市数字经济指数面板数据
  12. JAVA分页代码实例
  13. vector<vector>排序
  14. 人民币转换(阿拉伯数字转为中文大写的人民币格式)
  15. python狗品种识别_kaggle之本地运行识别狗品种
  16. 用find在html中找字符串,Windows CMD中 find命令(字符串查找)
  17. java 虚拟机JVM
  18. 2021年三季度中国医药商业行业A股上市企业营收排行榜:运盛医疗于8月初成功摘帽(附热榜TOP33详单)
  19. 2020过的不容易,2021定个小目标
  20. 简单区间问题 选择不相交区间 区间选点 区间覆盖问题解答及代码 C++

热门文章

  1. 超融合之VMware vSAN企业级超融合解决方案
  2. java中单根_通俗易懂的告诉你什么是java的单根继承结构
  3. RT-Thread 软件定时器(学习笔记)
  4. 城市交通的5D模式 | 不同尺度下研究的城市交通
  5. 常常被问路吗?我今年一共 26 次!
  6. PPT中如何插入带圈的11
  7. luogu P4848 崂山白花蛇草水
  8. go环境编译singularity失败报错:checking: host Go compiler (at least version 1.13或17)... not found!
  9. 远程桌面拷贝数据到远程计算机,远程桌面如何复制本地文件 远程桌面拷贝电脑上的文件方法...
  10. 2.8 Linux打包压缩与搜索命令