视频链接
课件链接
该视频课程使用 64位编译器！
本文使用编译器从Ch.3.6开始换到64位，因此3.6之前地址为4字节，之后为8字节！

Ch1.计算机系统漫游

C编译(ccl)与链接(ld)
Switch是否总时比if-else高效?
while循环总比for循环高效么?
指针引用比数组高效么?
函数的本地临时变量为什么比入参的引用更高效?
算数表达式的括号也能影响运算速度?

Ch2.信息的表示和处理

Integer – 补码与符号位

负数“补码”可视化

事实上，有符号数（two’s complement，补码）的符号位，是具有权重的，只不过需要取反，如-2表示为 1111,1110=−27+∑w=1w=62w+0∗2w=0=−21111,1110=\red {-2^7}+\sum_{\red {w=1}}^{w=6}2^w+0*2^{\red {w=0}}=-21111,1110=−27+∑w=1w=62w+0∗2w=0=−2
“unsigned and singed numbers have same bit pattern，just a bunch of bits to computer itself.”

//sizeof return unsigned int, cast a into unsigned, you got stuck forever
for(int a=1;a-sizeof(a)>=0;a--)
//so be care of unsigned "i" used for array in case a[i]
//i=0; i--=UMAX; a[i] may cause out of bounds
int main()
{unsigned int a=numeric_limits<unsigned int>::max();int b=-1;unsigned int c=-3;cout<<(int)a<<" "<<a<<endl;  //-1 4294967295cout<<(b==a?"True":"Flase")<<endl; //Truecout<<(b>a?"True":"Flase")<<endl; //Flasecout<<std::hex<<c<<" "<<-c<<" "<<c+(-c)<<endl; //fffffffd 3 0cout<<std::hex<<b<<"\n"<<numeric_limits<int>::max()<<endl; //ffffffff 7fffffff cout<<b+numeric_limits<int>::max()<<endl; //7ffffffereturn 0;
}

符号位扩展/截断

0110=−0×23+1×22+1×21+0×20=60110 = -0\times2^{3}+1\times2^{2}+1\times2^{1}+0\times2^{0}=60110=−0×23+1×22+1×21+0×20=6
1110=−1×23+1×22+1×21+0×20=−21110 = -1\times2^{3}+1\times2^{2}+1\times2^{1}+0\times2^{0}=-21110=−1×23+1×22+1×21+0×20=−2
11110=−1×24+1×23+1×22+1×21+0×20=−211110 = \red{-1\times2^{4}+1\times2^{3}}+1\times2^{2}+1\times2^{1}+0\times2^{0}=-211110=−1×24+1×23+1×22+1×21+0×20=−2
符号位左移填充，−1×2n+1+1×2n=−1×2n\red{-1\times2^{n+1}+1\times2^{n}}=-1\times2^{n}−1×2n+1+1×2n=−1×2n“负权重”不变

Floating point – IEEE 754

Numerical Form

(−1)sM×2E(-1)^{s}M\times2^{E} (−1)sM×2E

precision	sign field	exp field	frac field
value	s	exp	frac
single	1 bit	k = 8 bit	23 bit
double	1bit	k = 11 bit	52 bit

Extended precision 英特尔特用 | 1 bit |15 bit | 64 bit
共10字节，对齐16字节，因此后6字节为空

Normalized Values

exp≠000...0exp \neq 000...0exp=000...0 or 111...1111...1111...1

E=exp−bias=exp−(2k−1−1)E = exp - bias =exp - (2^{k-1}-1)E=exp−bias=exp−(2k−1−1)
M=1.xx...x2M = 1.xx...x_2M=1.xx...x2

Why bias =2k−1−1= 2^{k-1}-1=2k−1−1, not 2k−12^{k-1}2k−1?

Denormalized Values

exp=000...0exp = 000...0exp=000...0

E=1−biasE = 1 - biasE=1−bias
M=0.xx...x2M = 0.xx...x_2M=0.xx...x2

	s	exp	frac	represent
denorms	0	0000,0000	11…1	2−126×(2−1+...+2−23)=2−126×(1−2−23)2^{-126}\times(2^{-1}+...+2^{-23})=2^{-126}\times(1-2^{-23})2−126×(2−1+...+2−23)=2−126×(1−2−23)
norms	0	0000,0001	00…0	1.0×2−1261.0\times2^{-126}1.0×2−126
将"1.00…0"移位成0.1×210.1\times2^{1}0.1×21，并将212^121"隐藏"至E中，因此E=1−bias≠0−biasE=1-bias\red\neq 0-biasE=1−bias=0−bias
从而实现了从2−1262^{-126}2−126到2−1272^{-127}2−127，从DENORMmaxDENORM_{max}DENORMmax到NORMminNORM_{min}NORMmin的平滑过渡，使浮点数如无符号整型+1进位！
非标准化值最高精度=0,00000000,000...01=1×2−126−23=2−149=0,00000000,000...01=1\times2^{\red{-126-23}}=2^{-149}=0,00000000,000...01=1×2−126−23=2−149

使用非标准化浮点可以表示更接近“0”的小数，越靠近0，E越小分辨率越高，数与数间距越小

Special Values

	exp	frac	meaning
+∞+\infin+∞	111…1	000…0	overflows
NaN	111…1	≠\red\neq= 000…0	no feasible answer

1.0/−0.0=−∞1.0/-0.0=-\infin1.0/−0.0=−∞
−1=∞−∞=∞×0=NaN\sqrt{-1}=\infin - \infin =\infin \times 0 =NaN−1=∞−∞=∞×0=NaN

Special Properties of IEEE Encoding

Using unsigned Integer Comparison，Except NaN
Round to fit limited “frac field”.（ especially addition and multiplication）， IEEE use Nearest Even. 二进制中，末尾0为偶，1为奇

Round to nearest 2−22^{-2}2−2，watch out nearsest right bit（2−32^{-3}2−3 in this case）

value binary Note Rounded Rounded Value

23322 \frac{3}{32}2323 10.00011210.00\red{0}11_{2}10.000112 0.00011<2−30.00011<2^{-3}0.00011<2−3 10.00210.00_{2}10.002 2

23162\frac{3}{16}2163 10.00110210.00\red{1}10_{2}10.001102 0.00110>2−30.00110>2^{-3}0.00110>2−3 10.01210.01_{2}10.012 2142\frac{1}{4}241

2782\frac{7}{8}287 10.11100210.11\red{1}00_{2}10.111002 0.00100=2−30.00100=2^{-3}0.00100=2−3
got odd (10.11) if drop >it 10.1112+0.0012=11.00210.111_{2}+0.001_{2}=11.00_{2}10.1112+0.0012=11.002 333

2582\frac{5}{8}285 10.10100210.10\red{1}00_{2}10.101002 0.00100=2−30.00100=2^{-3}0.00100=2−3
got even (10.10) if drop >it 10.1012−0.0012=10.10210.101_{2}-0.001_{2}=10.10_{2}10.1012−0.0012=10.102 2122\frac{1}{2}221

value	binary	Note	Rounded	Rounded Value
23322 \frac{3}{32}2323	10.00011210.00\red{0}11_{2}10.000112	0.00011<2−30.00011<2^{-3}0.00011<2−3	10.00210.00_{2}10.002	2
23162\frac{3}{16}2163	10.00110210.00\red{1}10_{2}10.001102	0.00110>2−30.00110>2^{-3}0.00110>2−3	10.01210.01_{2}10.012	2142\frac{1}{4}241
2782\frac{7}{8}287	10.11100210.11\red{1}00_{2}10.111002	0.00100=2−30.00100=2^{-3}0.00100=2−3 got odd (10.11) if drop >it	10.1112+0.0012=11.00210.111_{2}+0.001_{2}=11.00_{2}10.1112+0.0012=11.002	333
2582\frac{5}{8}285	10.10100210.10\red{1}00_{2}10.101002	0.00100=2−30.00100=2^{-3}0.00100=2−3 got even (10.10) if drop >it	10.1012−0.0012=10.10210.101_{2}-0.001_{2}=10.10_{2}10.1012−0.0012=10.102	2122\frac{1}{2}221

Addition is Commutative but not associative（可交换，无结合）

(3.14+1e10)−1e10=1e10−1e10=0(3.14+1e10)-1e10=1e10-1e10=0(3.14+1e10)−1e10=1e10−1e10=0
3.14+(1e10−1e10)=3.14+0=3.143.14+(1e10-1e10)=3.14+0=3.143.14+(1e10−1e10)=3.14+0=3.14

Additive inverse （存在相反数，带符号位相加和为0）except for infinities and NaN
Multiplication Commutative but not Associative

(1e20∗1e20)∗1e−20=∞∗1e−20=∞(1e20*1e20)*1e-20=\infin * 1e-20=\infin(1e20∗1e20)∗1e−20=∞∗1e−20=∞
1e20∗(1e20∗1e−20)=1e20∗1=1e201e20*(1e20*1e-20)=1e20*1=1e201e20∗(1e20∗1e−20)=1e20∗1=1e20
dmin<0d_{min}<0dmin<0，dmin∗2=overflow<0d_{min}*2 =overflow < 0dmin∗2=overflow<0 #负数溢出也小于0

[Key] keep dynamic range in your mind while adding or multiplying floating point.
类型转换改变位(值)，如浮点转整型，直接Truncates fractional part（round toward zero）.

#include <iostream>
#include <limits>
using namespace std;
int main()
{int x=0x7FFFFFFF;float f=0.0;double d=0.0;cout<<"int(x):"<<x<<endl<<"float(x):"<<(float)x<<endl;cout<<((x==(int)(float)x)?"True":"False")<<endl;//返回True，可能有编译器优化f=(float)x;//float仅23个有效位，x中最后9位被round掉cout<<((x==(int)f)?"True":"False")<<endl; //返回Falsereturn 0;
}

Data Lab

Ch3.Machine Level Programming

3.1 x86

Intell x86（字母“x”86，不念“叉86”）

date Transistors MHz feature

8086 1978 29K 5-10 First 16-bit microprocessor,1MB addr space
Slight vatiation was a basis for IBM pc

8286

8386 1985 275K 16-33 32bit + “flat addressing”=> Unix capable
IA32（Intell Architecture 32）

Pentium 4E 2004 125M 2800-3800 First x86-64
power consumption 100W
power budget problem

Core 2 2006 291M 1060-3500 First multi-core Inter processor

Core i7 2008 731M 1700-3900 4 cores — shark machine

1980s，RISC vs. CISC.（Reduced instruction set computer）

	date	Transistors	MHz	feature
8086	1978	29K	5-10	First 16-bit microprocessor,1MB addr space Slight vatiation was a basis for IBM pc
8286
8386	1985	275K	16-33	32bit + “flat addressing”=> Unix capable IA32（Intell Architecture 32）
Pentium 4E	2004	125M	2800-3800	First x86-64 power consumption 100W power budget problem
Core 2	2006	291M	1060-3500	First multi-core Inter processor
Core i7	2008	731M	1700-3900	4 cores — shark machine
1980s，RISC vs. CISC.（Reduced instruction set computer）

Desktop Mode	Server Model
4 cores	8 cores
Integrated graphics	Integrated I/O
3.3-3.8 GHz	2~2.6 GHz
65W	45W

Advanced Micro Devices

years	Intell	AMD
2001	A little bit slower for a lot cheaper	Itanium /aɪˈteɪniəm/ 安腾Arch = IA64 too ideally， disappointing
2003	Come up with x86-64, or called “AMD64”	Insisting focus on IA64
2004		EM64T（almost identical to x86-64） lots of code still run in 32 bit mode.
Cross license allows AMD to produce x86 processors.

Acorn Risc Machine

Sufficiently simple and could be customized（个性化）.
Lower power requirement than x86 machine.
Sell companies the rights (Intellectual property) to use their designs,not chips.

Definitions

terminology	definitions	Examples
Architechture or ISA	Instruction Set Architecture The parts of a processor design that one needs to understand or write machine code.	Instruction Set Specification，Registers.
Microarchitecture	Implementation of the architecture ISA is the abstraction helps hardware people design	Cache sizes and core frequency.
Machine Code	Byte-level programs that processor executes
Assembly Code	Text version of machine code

3.2 Machine Code View

There is no way (or instructions) you can directly access or manipulate cache.

#mermaid-svg-8y33xYa5X8DWSDIr {font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}#mermaid-svg-8y33xYa5X8DWSDIr .error-icon{fill:#552222;}#mermaid-svg-8y33xYa5X8DWSDIr .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-8y33xYa5X8DWSDIr .edge-thickness-normal{stroke-width:2px;}#mermaid-svg-8y33xYa5X8DWSDIr .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-8y33xYa5X8DWSDIr .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-8y33xYa5X8DWSDIr .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-8y33xYa5X8DWSDIr .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-8y33xYa5X8DWSDIr .marker{fill:#333333;stroke:#333333;}#mermaid-svg-8y33xYa5X8DWSDIr .marker.cross{stroke:#333333;}#mermaid-svg-8y33xYa5X8DWSDIr svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-8y33xYa5X8DWSDIr .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-8y33xYa5X8DWSDIr .cluster-label text{fill:#333;}#mermaid-svg-8y33xYa5X8DWSDIr .cluster-label span{color:#333;}#mermaid-svg-8y33xYa5X8DWSDIr .label text,#mermaid-svg-8y33xYa5X8DWSDIr span{fill:#333;color:#333;}#mermaid-svg-8y33xYa5X8DWSDIr .node rect,#mermaid-svg-8y33xYa5X8DWSDIr .node circle,#mermaid-svg-8y33xYa5X8DWSDIr .node ellipse,#mermaid-svg-8y33xYa5X8DWSDIr .node polygon,#mermaid-svg-8y33xYa5X8DWSDIr .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-8y33xYa5X8DWSDIr .node .label{text-align:center;}#mermaid-svg-8y33xYa5X8DWSDIr .node.clickable{cursor:pointer;}#mermaid-svg-8y33xYa5X8DWSDIr .arrowheadPath{fill:#333333;}#mermaid-svg-8y33xYa5X8DWSDIr .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-8y33xYa5X8DWSDIr .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-8y33xYa5X8DWSDIr .edgeLabel{background-color:#e8e8e8;text-align:center;}#mermaid-svg-8y33xYa5X8DWSDIr .edgeLabel rect{opacity:0.5;background-color:#e8e8e8;fill:#e8e8e8;}#mermaid-svg-8y33xYa5X8DWSDIr .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-8y33xYa5X8DWSDIr .cluster text{fill:#333;}#mermaid-svg-8y33xYa5X8DWSDIr .cluster span{color:#333;}#mermaid-svg-8y33xYa5X8DWSDIr div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-8y33xYa5X8DWSDIr :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}

addresses data instructions

CPU

Registers

Condition Codes

Memory

Code

Data

Stack

PC：Program counter
Address of next instruction
Called “RIP”（x86-64）
Register file
Heavily used program data
Condition codes
Store status information about most recent arithmetic or logical operation
Used for conditional branching
Memory
Byte addressable array
Code and user data
Stack to support procedures

以之前的浮点实验为例
调用gcc 实际间接调用了一系列（a sequency of program）进程
Options starting with -g, -f, -m, -O, -W, or --param are automatically
【-O】Do optimization
【-Og】Use debug level optimizations to makethe code readable
【-O2】The most common optimization level

Instruction	Function	output
g++ -E *.cpp	Preprocess only	*.i
g++ -Og -S *.cpp	“Stop” after compile	*.s
g++ -c *.s	Compile to get assemblely code	*.o
g++ *.o	Link and get excutable program	*.exe、a.out
objdump -d *.exe	disassemble binary excutable program	*.s

Period indicates “not instructions” but information needs by debuger、linker and so on.

3.3 Machine-Level Programming I：Basics

Disasemble by gdb

#include <iostream>
using namespace std;
int main()
{cout<<"hello world\n";return 0;
}

>gdb .\*.exe
>(gdb) disassemble main
Dump of assembler code for function main:0x00401460 <+0>:     push   %ebp0x00401461 <+1>:     mov    %esp,%ebp0x00401463 <+3>:     and    $0xfffffff0,%esp0x00401466 <+6>:     sub    $0x10,%esp0x00401469 <+9>:     call   0x401a30 <__main>0x0040146e <+14>:    movl   $0x405065,0x4(%esp)0x00401476 <+22>:    movl   $0x408254,(%esp)
End of assembler dump.
>(gdb) x/3xb 0x00401466
0x401466 <main+6>:      0x83    0xec    0x10

x86-64 Integer Registers
%r* means 64bits
%e*x = %r*L （%r*x的low-order 32 bits）
why “ax，bx，ex …”? 历史沿用

Registers	Purposes
EAX	Accumulator for operands and results data
EBX	Pointer to data in the DS segment
ECX	Counter for string and loop operations
EDX	I/O pointer
ESI	Pointer to data in the segment pointed to by the DS register; source pointer for string operations
EDI	Pointer to data in the segment pointed to by the ES register; destination pointer for string operations
ESP	Stack pointer (in the SS segment)
EBP	Pointer to data on the stack (in the SS segment)，or called base pointer

详见 Intel SDM 下载地址

movq Src, Dest

“q” for “quad word” (64bits，Intell terminology)
“l” for “long word” (32bits)
“word” for 16 bits (8086)

Src Types	Example	Dest	C analog（treat reg as var）
Immediate	$0x400	Reg，Mem	temp = 0x4; *p=0x4;
Register	%rax，%r13	Reg，Mem	temp2 = temp1;*p=temp;
Memory	(%rax)	Reg	temp = *p;
Memory Dereference

Normal Form

movq (Reg)，[Reg/Mem]

location in Memory，Address = register value

C type	Machine Level

void swap(long *xp,long *yp)
{
long t0 = *xp;
long t1=*yp;
*xp=t1;
*yp=t0;
}|

swap:
movq (%rdi), %rax
movq (%rsi), %rdx
movq %rdx, (%rdi)
movq %rax, (%rsi)
ret >* [Arguments always come in (at most 6) specific registers in orders]()：rdi，rsi，... >* [Register Allocation algorithm？]()

Displacement

movq Disp(Reg)，[Reg/Mem]

location in memory，Address = value in Reg + const Disp

Most General/Elaborate Form

movq Disp(Rb，Ri，Scale)，Reg/Mem

location in memory，Address = Rb + Scale*Ri + Disp

leaq Src，Dst

Load Effective Address = ampersand(&) operation in C
Preety handy way to do arithmetic and C compiler likes to use it.
Src would be memory refrence.
Dest has to be register，store the address computed from Src， not value.

long m12(long x)
{return 12*x;
}

//g++ -S *.cpp
__Z3m12l:movl   %edx, %eaxaddl  %eax, %eaxaddl  %edx, %eaxsall  $2, %eaxpopl    %ebpret
//g++ -Og -S *.cpp
__Z3m12l:movl   4(%esp), %eaxleal   (%eax,%eax,2), %edx //x+x*2 ==> dxleal    0(,%edx,4), %eax //(x+x*2)*4 ==> axret
//lectureleal   (%eax,%eax,2), %edx //x+x*2 ==> dxsall    $2, %edx //(x+x*2)<<2 ==> axret

Other Instructions

Two Operand

Format	Computation in C form
addq Src, Dest	Dest = Dest + Src
subq Src, Dest	Dest = Dest - Src
imulq Src, Dest	Dest = Dest * Src
salq Src, Dest	Dest = Dest << Src (=shlq)
sarq Src, Dest	Dest = Dest >> Src (Arithmetic)
shrq Src, Dest	Dest = Dest >> Src (Logical)
xorq Src, Dest	Dest = Dest ^ Src
andq Src, Dest	Dest = Dest & Src
orq Src, Dest	Dest = Dest \| Src

One Operand

Format	Computation in C form
incq Dest	Dest = Dest + 1
decq Dest	Dest = Dest - 1
negq Dest	Dest = -1 * Dest (negate 取反)
notq Dest	Dest = ~ Dest (tilde “~” not exclamation “!”)
sarq Src, Dest	Dest = Dest >> Src (Arithmetic)
shrq Src, Dest	Dest = Dest >> Src (Logical)
xorq Src, Dest	Dest = Dest ^ Src
andq Src, Dest	Dest = Dest & Src
orq Src, Dest	Dest = Dest \| Src

3.4 Machine-Level Programming II：Control

So far the registers we should know

Temporary data ：%rax…%rdx，%rsi，%rdi，%r8~%r15…
Location of runtime stack：
%rsp（stack pointer）

%rbp（base pointer）

…
Location of current control ：%rip（instruction pointer）…
Status of recent tests：CF，ZF，SF，OF…Total 8 of them

Condition codes

All of them is one bit flag, get or set not directly but as a side effect of other operation.

Registers	name to memorize	set if
CF	Carry Flag	carry out from most significant bit （unsigned overflow）
ZF	Zero Flag	Dest == 0
SF	Sign Flag	Dest<0（as signed）
OF	Overflow Flag	two’s-complement（signed）overflow a>0，b>0，a+b<0 a<0，b<0，a+b>0 a*b<0，can’t overflow
Attention！ Lea 不影响标志位！

各指令对标志位的影响

cmpq Src2，Src1

Do substraction （Src1 - Src2），and set 4 flags above，but do nothing（like store in Dest）with the result

Src1-Src2	CF	ZF	SF	OF

0|0|0|0|0
=0|0|1|0|0
(unsigned) cmpq 2，1|1|0|1|0
(signed) cmpq 2，1|1|0|1|0
(signed) cmpq INT_MAX，INT_MIN|0|0|0|1

小实验

//test.cpp
#include <iostream>
using namespace std;
int main()
{unsigned int ua=1;unsigned int ub=2;unsigned int uc=0;uc=ua-ub;return 0;
}

g++ -g -DEBUG test.cpp #-g 保留行号
gdb a.exe
(gdb) list #打印行号
(gdb) break 9 #在return前设置断点
(gdb) run #运行并停在第一个断点
(gdb) info registers eflags
eflags 0x297 [ CF PF AF SF IF ] #中括号内Condition Code被置1

个人理解，只要符号位进位，CF便会 set

Src1+Src2|binary form|result|flags
-|-|-|-|-
INT_MIN2+INT_MIN2\frac {INT\_MIN}{2} + \frac {INT\_MIN}{2}2INT_MIN+2INT_MIN|1100…00
+1100…00|(1)10…00|CF=1，SF=1，OF=0
INT_MIN2+INT_MIN2−1\frac {INT\_MIN}{2} + \frac {INT\_MIN}{2} - 12INT_MIN+2INT_MIN−1|1100…00
+1100…00
+1111…11|(1)011…1|CF=1，SF=0，OF=1
负+负=正 overflow

testq Src2，Src1

Like computing a & b without setting destination.
testq Src1, Src2 = Computing（Src1 & Src2） set eflags

SetX Instructions

Set low-order byte of destination to 0 or 1 based on combinations of condition codes，without changing remaining 7 bytes.

Setx	Condition	set True if last result
sete	ZF	=0
setne	~ ZF	≠0\neq 0=0
sets	SF	<0
setns	~ SF	>=0
setg	~ （SF ^ OF）& （~ ZF）	> (signed)
setge	~ （SF ^ OF）	>= (signed)
setl	（SF ^ OF）	< (signed)
setle	（SF ^ OF）\| ZF	<= (signed)
seta	~CF & ~ZF	Above (unsiged)
setb	CF	Below (unsigned)

举例：

bool mycmp(long a,long b)
{return a>b;
}

mycmp:movl   8(%esp), %eaxcmpl   %eax, 4(%esp)setg   %al#movzbq  %al, %eax   #move with zero extension byte to quadret

x86-64’s（AMD）weird quirks
If result is 32 bits，remaining 32 bits will be zeroed，but other-length data type instruction won’t.

Jumping

jmp、je、jne、js、jns、jg、jge、jl、jle、ja、jb, same as setX.
举例：

 long abs(long x,long y){long result;if(x>y)result = x-y;elseresult = y-x;return result;
}

>gcc -Og -S -fno-if-conversion test.cpp
abs: # only exist in assembly code，changing into address in object codemovl  4(%esp), %edx #xmovl    8(%esp), %eax #ycmpl    %eax, %edx # y, xjg L14subl %edx, %eax # y-xret
L14:subl    %eax, %edx # x-ymovl    %edx, %eaxret

Conditional Moves

指令重排：if-else两个分支结果都计算，最后再选择结果返回.

形如 if(test) Dest = Src，straightly simple computations.
95后 x86 processors支持.
safe and no side effects.

>gcc -Og -S test.cpp #去掉-fno-if-conversion，gcc 默认允许指令重排
abs:movq    %rdi, %rax #xsubq   %rsi, %rax #x=x-ymovq  %rsi, %rdx #ysubq   %rdi, %rdx #y=y-xcmpq  %rsi, %rdi #x-ycmovle   %rdx, %rax #if(x<=y)ret(y-x)ret                #result in %rax

Why：Branches are very disruptive to instruction flow through pipelines，Wasteful but more efficient.
See：pipelining、branch prediction.
只要branch prediction足够准确（98%），“管线“执行效率就会很高（提前20条指令）。
预测错误，回头重算，最多花费40时钟周期。
（gcc主动）避免进行指令重排的情况

Expensive Computations in either branch.（如找质因数）
Risky Computations.（value = p ? (*p) : 0 ; //如判断合法性）
Computations with side effects. （value = x>0 ? x*=7 : x+=3; //如都会改变X本身的值）

Loops

“Do-While” Loop

long popcount(unsigned long x)
{long res=0;do{res += x & 0x1;x >>= 1;}while(x);return res;
}

popcount:movl    4(%esp), %edxmovl   $0, %eax
L12:movl    %edx, %ecxandl  $1, %ecxaddl    %ecx, %eaxshrl  %edxjne L12ret

“While” Loop
Test at the very beginning and skip the loop if condition doesn’t hold.

long popcount(unsigned long x)
{ ... while(x){...} ... }

popcount:movl    4(%esp), %edx # xmovl   $0, %eax
L13:testl   %edx, %edxje    L11movl %edx, %ecxandl  $1, %ecxaddl    %ecx, %eaxshrl  %edxjmp L13
L11:ret

“For” Loop
for( Init; Test; Update)
body;
Semantics =
Init;
while(Test)
{ Body; Update; }

long popcount(unsigned long x)
{size_t i=0;long res=0;for(i=0;i<32;i++){res += x & 0x1;x >>= 1;}return res;
}

>g++ -Og -S test.cpp
popcount:pushl  %ebxmovl    8(%esp), %ecx   # xmovl $0, %eax        # res=0movl    $0, %edx        # i=0
L13:cmpl    $31, %edx       # i>31ja L11                 # returnmovl    %ecx, %ebx      andl    $1, %ebx        # x & 1addl %ebx, %eax      # res += ishrl    %ecx            # x >>= 1addl    $1, %edx        # i += 1jmp   L13
L11:popl    %ebxret

提升编译优化等级-O1，无需initial test，转换为"do-while"循环。

>g++ -O1 -S test.cpp
popcount:...movl    $32, %edxmovl   $0, %eax
L4:movl %ecx, %ebx...shrl   %ecxsubl    $1, %edxjne L4...

首次test非真，无循环。

popcount:movl    $0, %eaxret

Switch Statements

条件变量必须是“整型“.
通过“Jump Table”的形式，将分支入口地址，按Case-Value大小排序，记录成表.
较紧凑随机访问，时间复杂度O(1).

long switch_try(unsigned long x)
{long res=0;switch (x){case 1:res += 1;break;case 2:res += 2;case 3:res *= 3;break;case 5:case 4:res -=1;break;case -1:res *= -1;break;default:res = 100; }return res;
}

switch_try:movl  4(%esp), %eax   # xleal 1(%eax), %edx   # case -1负数的情况，通过+偏置1转化为无符号数cmpl    $6, %edx        # case 中最大值5，偏置后为6ja L12                 # 小技巧# 用ja比较，小于-1的负数，偏置后仍为负数# 在无符号数格式下，大于有符号数的正数范围，从而归属 defultjmp   *L14(,%edx,4)       # Indirect jump，L14+4*(x+偏置) 的单元存储的值，作为jump地址.section .rdata,"dr".align 4
L14:                        # Jump Table，compiler给结构，assembler(汇编器)填地址.long   L13             # need a long type value as address x=-1.long  L12             # x=0.long L11             # x=1.long L16             # x=2.long L17             # x=3.long L18             # x=4.long L18             # x=5.text
L17:movl    $0, %eax        # x=3，res=0*3=0
L16:                        # x=2，res+=2，res==x，因此res用%eax表示 有优化leal  (%eax,%eax,2), %eax     # res=2*3=6ret
L13:                        # x=-1movl $0, %eax        # res=0*(-1)=0，compiler直接优化赋值0ret
L18:                        # x=4movl  $-1, %eaxret
L12:                        # default casemovl  $100, %eax
L11:                        # x=1rep ret                   # ja前已偏置+1，故直接返回%eax

稀疏（如 case [0、100]）退化为if-else形式，时间复杂度O(n).

long switch_try(long x)
{long res=0;switch (x){case 1:res=0;break;case 100:res=99;break;default:res = -1; }return res;
}

switch_try:movl  4(%esp), %eaxcmpl   $1, %eaxje  L13cmpl $100, %eaxje    L15movl $-1, %eaxret
L13:movl    $0, %eaxret
L15:movl    $99, %eaxret

较稀疏使用二叉树，时间复杂度O(log2nlog_2^{n}log2n).

Switch是否总时比if-else高效？

根据以上分析，答案是否定的

We are never happy with a simple explanation. We want to understand how we could actually implement it as a program if we ever had to do so.

3.5 Machine-Level Programming II：Procedures

ABI，Application Binary Interface，一种机器码层面的二进制程序接口协定。

Passing control
- Beginning of procedure code
- Back to return point
Passing data
- Procedure arguments
- Return value
Memory management
- Allocate during procedure
- Deallocate upon return
  One of main targets is doing whatever is omly absolutely needed.

Stack Structure

Adress	Values Meaning
High Adress	（%rbp）Stack Bottom
…
Low Adress	（ %rsp ）Stack Top

pushq Src
step 1：Fetch operand at Src（imediate or registers）.
step 2：Decrement %rsp by 8.
step 3：Write operand at address given by %rsp.

popq Dest
step 1：Read value at address given by %rsp.
step 2：Increament %rsp by 8.
step 3：Store value at Dest（must be register）.
Data at top of stack is stll there in the memory，but is no longer part of stack.

passing control

call label
step1：Push return address on stack，sp=sp-sizeof(address)
step2：Jump to label //%rip是不允许被显式操作的
ret
step1：Pop address（of next instruction right after call）from stack，sp=sp+sizeof(address)
step2：Jump to address

long sub_try(long x)
{return x+1;
}
long call_try(long x)
{return sub_try(x+1);
}

sub_try:movl 4(%esp), %eaxaddl   $1, %eaxret
call_try:subl   $4, %espmovl    8(%esp), %eaxaddl   $1, %eaxmovl    %eax, (%esp)call    sub_try         # sp=sp-4; *sp=addr after calladdl    $4, %espret

passing data

ABI规定
前6个整型入参用寄存器{ %rdi、%rsi、%rdx、%rcx、%r8、%r9 }，6个之后的参数适用栈，返回值用 %rax。

long incr(long *p,long val)
{long x=*p;long y=x+val;*p=y;return x;
}
long call_incr()
{long v1=15213;long v2=incr(&v1,3000);return v1+v2;
}

incr:movl    4(%esp), %edx   # dx=*(-28+4)=-4movl (%edx), %eax    # ax=*(-4)=15213movl  %eax, %ecx      # cx=axaddl    8(%esp), %ecx   # cx=cx+*(-20)=15213+3000movl   %ecx, (%edx)    # *(-4)=18213ret                       # sp=sp+4=-24
call_incr:                  # 设sp=0   <--- startsubl    $24, %esp       # sp=-24 分配24字节空间 movl $15213, 20(%esp)# *(-4)=15213movl  $3000, 4(%esp)  # *(-20)=3000leal  20(%esp), %eax  # ax=-4movl    %eax, (%esp)    # *(-24)=-4call    incr            # sp=sp-4=-28，4字节返回地址入栈addl    20(%esp), %eax  # ax=15213+*(-4)=15213+18213addl    $24, %esp       # sp=0 清空栈ret

浮点型入参使用一组特殊的寄存器。

函数的本地临时变量为什么比入参的引用更高效？

因为临时变量用寄存器，而引用需要解引用，或间接寻址，相对低效

必须给 Local Data 分配内存的几种情况：
- 寄存器不够用
- 对Local Variable 应用了 ‘&’（address operator），寄存器没有地址，只能用内存表示
- Local variable 是数组或结构体

Memory management

Code must be “Reentrant”
- Multiple simultaneous instantiations of single procedure
Need place to store state of each instantiation
- Arguments
- Local variables
- Return pointer

stack fame ：Each block we use for particular call。
发生调用时：

[Caller 栈帧 call 指令] sp = sp - sizeof( Addr )
[Caller 栈帧 call 指令] 将Return Addr 写入 sp 指向的栈帧尾部
[Callee 栈帧] sp = sp - frame size，分配 Calle 栈帧
[Callee 栈帧] 基于sp寻址，由高到低地址先后保存 Registers（即 Callee Save），局部变量等
[Callee 栈帧] sp = sp + frame size，准备返回 Caller
[Caller 栈帧 ret 指令] 将 sp 指向的 Return Addr 返回给 ip
[Caller 栈帧 ret 指令] sp = sp + sizeof( Addr )
[Caller 栈帧] ret 执行完毕，开始执行 Caller 返回点命令

P calls Q，Arguments > No.6存在P帧中。
大多数系统限制了栈的最大深度。
%rbp 作为 frame pointer。
某些情况下%rbp会用于记录 caller的栈帧底。
《CS:APP（Third.Ed）》英文版 P.286

void proc(long a1, long *a1p, int a2, int *a2p, short a3, short *a3p, char a4, char *a4p)
{*a1p += a1;
*a2p += a2;
*a3p += a3;
*a4p += a4;
}
long call_proc()
{long x1 = 1;
int x2 = 2;
short x3 = 3;
char x4 = 4;
proc(x1, &x1, x2, &x2, x3, &x3, x4, &x4);
return (x1+x2)*(x3-x4);
}

call_proc:               # callee savesubq $32, %rsp         # Allocate 32-byte stack framemovq $1, 24(%rsp)     # Store 1 in &x1movl $2, 20(%rsp)   # Store 2 in &x2movw $3, 18(%rsp)   # Store 3 in &x3movb $4, 17(%rsp)   # Store 4 in &x4leaq 17(%rsp), %rax # Create &x4movq %rax, 8(%rsp)  # Store &x4 as argument 8movl $4, (%rsp)        # Store 4 as argument 7leaq 18(%rsp), %r9   # Pass &x3 as argument 6movl $3, %r8d       # Pass 3 as argument 5leaq 20(%rsp), %rcx # Pass &x2 as argument 4movl $2, %edx         # Pass 2 as argument 3leaq 24(%rsp), %rsi # Pass &x1 as argument 2movl $1, %edi         # Pass 1 as argument 1call procmovslq 20(%rsp), %rdx    # Get x2 and convert to longaddq 24(%rsp), %rdx     # Compute x1+x2movswl 18(%rsp), %eax   # Get x3 and convert to intmovsbl 17(%rsp), %ecx    # Get x4 and convert to intsubl %ecx, %eax      # Compute x3-x4cltq                 # Convert to longimulq %rdx, %rax   # Compute (x1+x2) * (x3-x4)addq $32, %rsp      # Deallocate stack frameret                 # Return
proc:movq 16(%rsp), %raxaddq %rdi, (%rsi) addl %edx, (%rcx) addw %r8w, (%r9) movl 8(%rsp), %edx addb %dl, (%rax) ret

[ABI Conventions] 程序都约定俗成的遵守：
- “Caller Saved” 假设使用到的寄存器的值会被Callee覆写，Caller先保存
  %rax、%rdi、%rsi、%rdx、%rcx、%r8、%r9、%r10、%r11等
- “Callee Saved” 想用寄存器？先入栈保存，ret前先出栈，“物归原主”，Caller畅用寄存器
  %rbx、%12、%13、%14、special { %rbp、%rsp }

“Callee Saved” 的情况较多
《CS:APP（Third.Ed）》英文 P.288

long P(long x, long y)
{long u = Q(y);
long v = Q(x);
return u + v;
}

P:pushq %rbp         # Save %rbp         | Callee-Savedpushq %rbx        # Save %rbxsubq $8, %rsp    # Align stack framemovq %rdi, %rbp # Save x             | Caller-Savedmovq %rsi, %rdi # Move y to first argumentcall Q          # Call Q(y)movq %rax, %rbx # Save resultmovq %rbp, %rdi # Move x to first argumentcall Q            # Call Q(x)addq %rbx, %rax # Add Q(y) to Q(x), believe rbp not changed before & after Q addq $8, %rsp   # Deallocate last part of stackpopq %rbx        # Restore %rbx 注意先进后出，变量出栈反入栈顺序popq %rbp         # Restore %rbp
ret

Illustration of Recursion

视频例题

unsigned long pcount_r(unsigned long x)
{if (x==0)return 0;elsereturn (x & 1) + pcount_r(x >> 1);
}

pcount_r:pushl   %ebx                # *sp = ebx; sp = sp-4;subl   $24, %esp           #  sp = sp-24;movl 32(%esp), %eax      # eax = *(sp+32) = x;testl   %eax, %eax          # jne       L14                 # if(eax != 0) goto L14;
L12:addl    $24, %esp           # sp = sp+24；popl  %ebx                # sp = sp+4; ebx = *sp; ret
L14:movl    %eax, %ebx          # ebx = eax;andl   $1, %ebx            # ebx = ebx & 1;shrl   %eax                # eax >> 1;movl   %eax, (%esp)        # *sp = eax;call   pcount_r            #addl   %ebx, %eax          # eax = eax + { ebx = （x & 1）}# eax 并没有被push，最后一层callee返回时eax = 0jmp       L12
echo "eax即作为输入参数，最in的一层callee中变0后又作为输出暂存，实在是妙啊！！！"

栈帧让每次函数调用能够存储临时变量，寄存器，和返回地址；
栈的先进后出，call / return的变量保护原则，保证层层调用中的数据安全性，除非overflow；
相互递归（mutual recursion，P calls Q，Q calls P）同样适用；

3.6 Machine-Level Programming IV：Data

3.6.1 Array

对于复杂的数据结构，建议拆分用typedef多次嵌套定义，明晰结构

//声明大小为5的数组，元素是函数指针，函数入参为(int)，返回值为int指针
int *(*a[5])(int);
//使用typedef简化声明
typedef int *(*pFun)(int);
pFun a[5];

//声明大小为5的数组，元素是A类函数指针，A类函数入参为B类函数指针，B类函数无入参，无返回值
int *(*b[5])(void(*)(void));
//使用typedef分两步简化声明
typedef void(*pVoidFunc)(void); //定义函数类型B
typedef int *(*pFunc)(pVoidFunc);
pFunc b[5];

注意typedef是存储类关键字(如 static、auto、mutable、register等)

typedef static int STCINT;
>> 编译报错"一个以上的存储类"

汇编程序员期望一种看似高级语言，但又留有汇编层面灵活性、可玩性(技巧层面)，C语言诞生。
之前操作系统都是用汇编写的(=͟͟͞͞=͟͟͞͞(●⁰ꈊ⁰● |||))，Kernighan、Dennis Ritchie等人为实现灵活性，在创造C时引入了指针操作。
在继续探讨指针前需要注意：

int main()
{int *p=NULL;cout << sizeof(p)<<endl; // = 4return 0;
}

int类型的指针大小为4，说明并不是64位地址。使用 gcc -v 查看后醒悟使用的是32位编译器，赶紧切换64位

>> gcc -v
...
Target: x86_64-w64-mingw32
...

int main()
{int *p=NULL;cout << sizeof(p)<<endl;int a[8]={0};cout<<"\nsizeof(a)"<<sizeof(a)<<"\n"                // = 32<<"\nsizeof(a[0])"<<sizeof(a[0])<<"\n"          // = 4  a[0]=*(a+0)<<"\nsizeof(*a)"<<sizeof(*a)<<"\n"<<endl;       // = 4 int b[2][3]={0};cout<<"\nsizeof(b)"<<sizeof(b)<<"\n"                // = 24<<"\nsizeof(b[0])"<<sizeof(b[0])<<"\n"          // = 12 b[0]=*(b+0)<<"\nsizeof(*b)"<<sizeof(*b)<<"\n"              // = 12<<"\nsizeof(b[1][1])"<<sizeof(b[1][1])<<"\n"    // = 4  b[1][1]=*(b[1]+1)<<"\nsizeof(*b[1])"<<sizeof(*b[1])<<"\n"<<endl; // = 4   cout <<a<<"=?="<<&a<<endl;                          // 0x61fdf0=?=0x61fdf0cout <<b[1]<<"=?="<<b[0]<<":"<<b[1]-b[0]<<endl;     // 0x61fddc=?=0x61fdd0:3b[0][1]=1;b[1][0]=2;cout <<*b[1]<<endl;          // *b[1] = 2 = *(*(b+1))，说明 '[]' 优先级> '*'return 0;
}

二维数组的结构 = 数组{数组指针1、数组指针2、…}，而数组指针1指向数组{元素1、元素2、…}，且二维数组是一段地址连续的空间，视频里将这种数组称作 Nested array。

以下举例说明，非直接声明的二维数组，分配的空间地址并不连续，视频里将这种数组称作 Multi-level array。

int get_ele(int arr[3][3],size_t r,size_t c)
{return  arr[r][c];
}
int main()
{int a1[3]={1,2,3},a2[3]={4,5,6},a3[3]={7,8,9};int *(arr[3])={a1,a2,a3};cout<<arr[2]<<"\n"                            // 0x61fdfc<<arr[1]<<"\n"                             // 0x61fe08<<arr[2]-arr[1]<<"\n"   // -3<<(char*)(arr[2])-(char*)(arr[1])<<endl;  // -12int arr2[3][3]={0};cout<<arr2[2]<<"\n"                             // 0x61fdc8<<arr2[1]<<"\n"                            // 0x61fdbc<<arr2[2]-arr2[1]<<"\n"                    // 3<<(char*)(arr2[2])-(char*)(arr2[1])<<endl;  // 12get_ele(arr2,1,2);return 0;
}

get_ele:leaq (%rdx,%rdx,2), %rdx # rdx = rdx + 2*rdx = 3*rdxleaq  0(,%rdx,4), %rax    # rax = 4 * rdx addq   %rax, %rcx          # rcx = rcx + 12 * rmovl  (%rcx,%r8,4), %eax  # eax = *(rcx + j * 4)ret

Nested Array 和 Multi-Level Array 在汇编层面完全不同：

Nested Array 因为空间连续，只需要一次Memory Reference就能拿到元素：
NA[index][digit]=∗(NA+index⋅col⋅sizeof(elem)+digit⋅sizeof(elem))NA[index][digit]=*(NA+index \cdot col \cdot sizeof(elem) + digit \cdot sizeof(elem))NA[index][digit]=∗(NA+index⋅col⋅sizeof(elem)+digit⋅sizeof(elem))

Multi-Level Array 需要两次Memory Reference，第一次拿数组指针，第二次拿元素：
MA[index][digit]=∗(∗(MA+index⋅sizeof(pointer))+digit⋅sizeof(elem))MA[index][digit]=*(*(MA+index \cdot sizeof(pointer)) + digit \cdot sizeof(elem))MA[index][digit]=∗(∗(MA+index⋅sizeof(pointer))+digit⋅sizeof(elem))

3.6.2 Structure

编译器构建空间，处理（一段连续）地址，汇编代码不会体现；
定义决定"域"的先后顺序，不会因为对齐或紧凑而调换位置。

struct A
{int a[4];int i;struct A *next;
};
void set_val(struct A* pA, int val)
{while(pA){pA->a[pA->i]=val;pA=pA->next;}
}

set_val: # rcx := pA, rax := i, edx := val
L7:testq    %rcx, %rcxje        L5movslq    16(%rcx), %rax  # 4 byte value and do sign extensionmovl    %edx, (%rcx,%rax,4)movq 24(%rcx), %rcx  # 注意这里next相对A的起始地址偏移24jmp       L7
L5:ret

注意这里 next 相对 A的起始地址偏移24，是因为数据对齐，i 之后留4空字节(padding bytes)，对齐8字节。现代计算机内存通常一次取64个字节，如果存储对象因为地址没有对齐，横跨两个64字节块，将导致系统花费很多额外的步骤来"拼数据"。x86系统下没有对齐只会导致运行速度变慢，其他系统可能直接就内存错误。

结构体成员大小为 k Bytes，则该成员的起始地址应为 k 的整数倍
结构体“最大”成员 K Bytes，结构体总大小为 K 的整数倍（末尾补空字节）字段

与其声明__attribute__((packed))强制编译器不对齐，不如定义结构体"大"Field在前，"小"Field在后，来减少浪费的 Padding Bytes。
对齐只针对原始数据类型(char、short、int…)，汇编层面不存在“聚合类数据”(数组、结构体…)。

3.6.3 Floating Point

8087 – masterpiece of engineering，单个芯片，具备了实现IEEE浮点数所需的全部硬件，co-developed with IEEE浮点标准x87 FP，但编程模型实在糟糕因此被踢出了教材
SSE FP，special case use of vector instructions
AVX FP，Newest version，similar to SSE
XMM Register
{XXM0、…XXM15}共16个，每个16字节，按需可作为16个char，8个short，4个int，4个float，2个double，1个double long。虽然数据种类不同，但可以将作用这些数据的操作方法合并为一种高级的抽象实现。这些寄存器都是caller-saved。

Scalar Operation
addss = add for scalar single precision
SIMD（single instruction multiple data）Operation
addps = add for pack single precision

整型使用regular registers，浮点型使用XXM registers，当然也可以都使用XXM提高运算速度就是有点浪费。传参时整型与浮点型交错按规矩依次入座

double double_test(float *pd, float Val)
{float x=*(pd);if(Val>x)*pd=x+Val;return x;
}

double_test:movss    (%rcx), %xmm0comiss %xmm0, %xmm1jbe .L6addss    %xmm0, %xmm1movss   %xmm1, (%rcx)
.L6:cvtss2sd    %xmm0, %xmm0ret

3.6 Machine-Level Programming V：Advanced Topics

miscellaneous topics

3.6.1 Memory Layout

目前64位系统只使用了47位地址，约256×1012256 \times 10^{12}256×1012字节约256 Terabytes。
Terabytes << Petabytes << Exabytes(Google累计信息总量) << Zettabyte(全人类信息总量)

HEX Address	Content	note
00007FFFFFFFFFFF	Stack	0x7FFFFFFFFFFF -0x7FFFFF7FFFFF = 2232^{23}223 = 8M
00007FFFFF7FFFFF
	Shared Libraries	Executable machine instructions，read only

	Heap	Dynamically allocated as needed when malloc()、calloc()、new() Address moving up
	Data	Statically allocated data global vars、static vars、const string
	Text	Executable machine instructions，read only
400000

表格自2015年Slider，2020年Slider中，Shared Libraries 处于最高地址，高于Stack。

Cent OS 环境下可使用 ulimit -a 查看全部系统限制：

[root@VM-4-10-centos]# ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 14819
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 100001
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 14819
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

观察分配地址：

#include <iostream>
using namespace std;
typedef int (*P1)(void);
typedef void (*P2)(void);
int global_arr[20]={0};
int global_var=0;
void stack_frame_obs()
{int local_arr[20]={0};int *pc=(int*)malloc(20);int *pc_last=&pc[20]; cout<<"stack_local_arr:\t"<<&local_arr<<"\n"<<"stack_local_arr_last:\t"<<&local_arr[20]<<"\n"<<"stack_pc:\t"<<pc<<"\n"<<"stack_pc_last:\t"<<pc_last<<endl;return;
}
void memory_obs(void)
{int local_val=0;int local_arr[20]={0};//数组指针强转字符指针，计算最后元素地址int *local_arr_last=&local_arr[20]; int *global_arr_last=&global_arr[20];int *pc=(int*)malloc(20);int *pc_last=&pc[20]; stack_frame_obs();cout<<"local_val:\t"<<&local_val<<"\n"<<"local_arr:\t"<<&local_arr<<"\n"<<"local_arr_last:\t"<<local_arr_last<<"\n"<<"pc:\t"<<pc<<"\n"<<"pc_last:\t"<<pc_last<<"\n"<<"global_var:\t"<<&global_var<<"\n"<<"global_arr:\t"<<global_arr<<"\n"<<"global_arr_last:\t"<<global_arr_last<<endl;return;
}
int main()
{memory_obs();P1 pfunc1=main;P2 pfunc2=memory_obs;cout<<"Main:\t"<<(void *)pfunc1<<endl;cout<<"Memory_obs:\t"<<(void *)pfunc2<<endl;return 0;
}

Cent OS 结果

[root@VM-4-10-centos]# ./a.out
stack_local_arr:        0x7ffe44a783d0
stack_local_arr_last:   0x7ffe44a78420  // 栈地址始终高于堆地址
stack_pc:               0x8abf10        // 地址高于pc，堆按需分配，地址递增
stack_pc_last:          0x8abf60
local_val:              0x7ffe44a7849c
local_arr:              0x7ffe44a78440  // 地址高于stack_local_arr，栈帧地址递减
local_arr_last:         0x7ffe44a78490
pc:                     0x8abeb0        // 堆内、栈帧内数组元素地址递增
pc_last:                0x8abf00        // < stack_pc= 0x8abf10
global_var:             0x6021d0
global_arr:             0x602180
global_arr_last:        0x6021d0
Main:                   0x400bde
Memory_obs:             0x4009f1        //text 可执行指令始终处于最低地址

Win64 环境下堆地址居然高于栈地址？Whatever

>>PS C:Users> .\a.exe
stack_local_arr:        0x61fcb0
stack_local_arr_last:   0x61fd00
stack_pc:               0xec1680
stack_pc_last:          0xec16d0
local_val:              0x61fdac
local_arr:              0x61fd50
local_arr_last:         0x61fda0
pc:                     0xec1620
pc_last:                0xec1670
global_var:             0x408090
global_arr:             0x408040
global_arr_last:        0x408090
Main:                   0x40182f
Memory_obs:             0x401656

3.6.2 Buffer Overflow

Exceeding the memory size allocated for an array，potentially that risk of being a vulnerability.
Most come from (culprit)

Unchecked lengths on string inputs，worst one ⇒ gets()
Particularly for bounded character arrays on stack

gets()编写于1970s，UNIX刚发行，那时人们还不怎么考虑安全问题。

// kind of implementation of Unix function gets()
char *gets(char *dest)
{int c = getchar(); //EOF 应该是整型，char可能不够大char *p = dest;while(c! = EOF && c !='\n'){*p++ = c;c = getchar();}*p='\0';return dest;
}

Others like strcpy、strcat、scanf(%s)、sscanf、fscanf，they all have no idea what limit is on number of characters to read. Typically，return address should be overwrite first.

Code Injection Attacks

Input sting contains byte representation of executable code
Overwrite return address A with address of buffer B
When callee return，exploit code injected within gets() will be executed.

二进制层面的注入，与SQL数据库注入不同。

Original “Morris worm” (1988)

finger user@host
finger 命令使用 gets() 接收信息
finger “exploit-code padding new-return-address”，exploit-code = excuted root shell on victim machine with a direct TCP connection to the attacker.
CERT computer emergency response team 就此成立并安家CMU

“IM wars” (1999)

AOL 聊天软件客户端存在注入漏洞，AOL注入测试PC是不是Microsoft平台，达到 Block MS 的目的，More than 10 skirmishes between MS and AOL

Twilight hack on wii (2000s)
…

Worms and Viruses

Worm
- Run by itself
- Propagate a fully working version of itself to other computers
Virus
- Adds itself to other programs
- Does not run independently，work as changing behavior of program

Protection

Avoid overflow vulnerabilities
- fgets() instead of gets()
- strncpy() instead of strcpy()
- scanf(“%ns”) instead of scanf(“%s”)
Employ system-level protections
- ASLR，Address Space Layout Randomization，随机分配栈大小
- Nonexecutable code segments，硬件工程师配合实现，显式指定内存段可执行权限（AMD 先行，Intell跟上）
Stack Canaries (使用这种策略都可以叫 “xx金丝雀”，名字来源美国早期煤矿工人带雀下矿)
- 栈上缓冲区(Buffer)尾部接常量，ret前检查这个常量是否被改动
- GCC implementation，-fstack-protector（default，但gcc 8.4.1实验中没有发现Canary保护）

Return Oriented Programming

首先说明，这种攻击方式依旧无法破解Canary校验，但可以避开ASLR和堆栈执行权限限制。利用被攻击者代码，如stdlib库函数，地址相对确定，内容相对确定。核心思想：找 Gadget，拼接一系列 gadget 指令序列，完成完整的攻击任务。

假设 ret_orit 是一个库函数

int ret_orit(int a,int b)
{return a+b;
}

0000000000401596 <ret_orit>:401596:    8d 04 11                lea    (%rcx,%rdx,1),%eax401599:    c3                      retq

Gadget address = 0x401596，完成了 %eax = %rcx + %rdx 动作。
有趣的是，在X86架构中，ret指令以 0xc3 结尾，那就很容找到这些片段的位置了。
假设我们始终取 0xc3 的前三个字节凑指令：

void ret_orit(int *p)
{*p=0x11048d22;return;
}

0000000000401596 <ret_orit>:401596:    c7 01 22 8d 04 11       movl   $0x11048d22,(%rcx)40159c:    c3                      retq

Gadget address = 0x401599，三个字节0x8d、0x04、0x11同样完成了 %eax = %rcx + %rdx 。

“Just match the byte patterm of some existing code.”

有了前两种方法，剩下只需要组合这些 Gadget：

Address	Content
stack	address of Gadget n code
…	…
%rsp	address of Gadget 1 code (used to be callee return address)
通过缓冲区溢出，将callee return address 及其后的所有地址，依次替换为 Gadget 的地址，则跳转执行 Gadget 命令后，Gadget 最后的 ret 指令又使得 %rip 从 %rsp - 8 取下一条 Gadget 的地址，再 ret，再跳转 … 直到完成攻击。

3.6.3 Unions

A way to ceate an alias that will let you refrence memory in different ways.
联合体并不改变实际位，只改变解读位的方式。

#include <iostream>
#include <stdio.h>
using namespace std;
typedef union{int a;float b;
}i_a_f;
int main()
{i_a_f t;t.a=1;float b=t.a;cout<<"\nunion(int):"<<t.a   // union(int):1<<"\nunion(float):"<<t.b// union(float):1.4013e-45<<"\ncast:"<<b;            // cast:1return 0;
}

通过 Union 很容易了解到机器的 Byte Ordering
Big Endian 最大的在尾端（地址最低）
Little Endian 最小的在尾端（地址最低）x86、ARM、IOS
Bi Endian 大小端都行

CMU 15-213 CSAPP (Ch1~Ch3)相关推荐

CMU 15-213 CSAPP (Ch5~Ch7)
CMU 15-213 CSAPP (Ch1~Ch3) CMU 15-213 CSAPP (Ch5~Ch7) CMU 15-213 CSAPP (Ch8) CMU 15-213 CSAPP (Ch9) ...
美国计算机名校例如MIT ，CMU等招牌经典公开课程
本篇博客主要来自于知乎上的经典问题: 美国计算机名校例如MIT ,CMU ,有哪些公认的好课并且有课程讲义的,适合国内学生自学的? https://www.zhihu.com/question/575 ...
MIT CMU CS系列课程
1. Distributed Systems(MIT)(快要系统设计了): https://www.youtube.com/watch?v=cQP8WApzIQQ&list=PLr ...
rman 备份脚本之总结分析
rman 备份脚本之总结分析脚本一: run{ allocate channel ch1 device type disk; allocate channel ch2 device type dis ...
Golang Devops项目开发(1)
1.1 GO语言基础 1 初识Go语言 1.1.1 开发环境搭建参考文档:<Windows Go语言环境搭建> 1.2.1 Go语言特性-垃圾回收 a. 内存自动回收,再也不需要开发人员 ...
在哪儿能找c语言编程题,C语言程序设计的试题及答案
大家在考程序员时,C语言程序设计大家有了解吗?下面小编为大家分享了,供大家参考. 第一章基础知识一.填空 1. 每个 C 程序都必须有且仅有一个________ 函数. 2. C 语言程序开发到执 ...
计算机专业大学四年应该怎么过才有意义？
写在前面本文章已在知乎获得9104收藏+1845喜欢+1557点赞,并被知乎官方收录,是知乎热门内容. 全文共计1万+字,预计阅读15分钟. 华科&阿里学长的血泪建议! 一句话总结:打牢基础 ...
FT计算机系统,芯片CP/FT测试的基本概念理解
进程与线程理解---进程与线程的一个简单解释进程与线程的简单解释进程(process)和线程(thread)是操作系统的基本概念,但是它们比较抽象,不容易掌握. 最近,我读到一篇材料,发现有一个很 ...
python编程入门指南-编程入门指南
编程入门指南 ----------------------------------------------- 编程入门指南 v1.5 --- https://zhuanlan.zhihu.com/p/ ...

CMU 15-213 CSAPP (Ch1~Ch3)