CMU CSAPP : Decoding lab

该项目来源于CMU（卡内基·梅隆大学 Carnegie Mellon University）的课程CSAPP中的 exercise

原说明

You have just intercepted an encoded message. The message is a sequence of bits which reads as follows in hexadecimal:

6363636363636363724646636F6D6F72
466D203A65693A7243646E206F54540A
5920453A54756F0A6F6F470A21643A6F
594E2020206F776F797275744563200A
6F786F686E6963736C206765796C656B
2C3365737420346E20216F74726F5966
7565636F202061206C61676374206C6F
20206F74747865656561727632727463
6E617920680A64746F69766120646E69
21687467630020656C6C786178742078
6578206F727478787863617800783174

You have no idea how to decode it, but you know that your grade depends on it, so you are willing to do anything to extract the message. Fortunately, one of your many agents on the field has stolen the source code for the decoder. This agent (007) has put the code and the message in the file secret.cpp, which you can download from the laboratory of your technical staff (Q).

Q has noticed that the decoder takes four integers as arguments. Executing the decoder with various arguments seems to either crash the program or produce unintelligible output. It seems that the correct four integers have to be chosen in order for the program to produce the decoded message. These four integers are the "secret keys."

007 has been unable to find the keys, but from the desk of the encrypting personnel he was able to cunningly retrieve the first five characters of the unencoded message. These characters are:

From:

Assignment

Your assignment is to decode the message, and find the keys.

Reminders

This exercise is not extremely difficult. However, the strategy of trying things until something works will be ineffective. Try to understand the material in the course, particularly the following:

Memory contains nothing but bits. Bits are interpreted as integers, characters, or instructions by the compiler, but they have no intrinsic type in memory.
The compiler can be strong-armed into interpreting integers as characters, or even as instructions, and vice versa.
Every group of 8 bits (a byte) has an address.
A pointer in C is merely a stored memory address.
The activation records for each function call are all together in memory, and they are organized in a stack that grows downwards and shrinks upwards on function calls and returns respectively.
The return address of one function as well as the addresses of all of its local variables are allocated within one activation record.

Strategy

The designers of this decoder weren't very good. They made it possible for us to attack the keys in two independent parts. Try to break the first two keys first, and do not try to break the third and fourth keys until you have succeeded with the first two.

You can do the first part by specifying only two integer arguments when you execute the decoder. If you get the first and second keys right, a message that starts with From: will appear. This message is not the true message, but a decoy. It is useful, however, to let you know that you have indeed broken the first two keys.

In breaking the first two keys, realize that the function process_keys12 must be somehow changing the value of the dummy variable. This must be so, because the variables start and stride control the extraction of the message, and they are calculated from the value of dummy.

In breaking the third and fourth keys, try to get the code to invoke extract_message2 instead of extract_message1. This modification must somehow be controlled from within the function process_keys34.

Files

When you are done, write a brief report that includes at least the following:

The secret message.
The secret keys.
One paragraph describing, in your own prose, what process_keys12 does. For example, you might say that it modifies a specific program variable.
The meaning of the first two keys in terms of variables and addresses in the decoder program. For example, you might describe key2 by saying that its X-Y bits contain the value to which variable start is set. Or you might describe key1 by saying, for example, that it must be set equal to the number of memory addresses separating the address of two specific variables. These are only examples.
One paragraph describing, in your own prose, what process_keys34 does.
One paragraph describing the line of source code that is executed when the first call to process_keys34 returns.
The meaning of the third and fourth keys in terms of variables and addresses in the decoder program.

Be precise, clear, and brief in each of the points above. Your report should not, in any case, be longer than one page. Do not get frustrated if this takes a little longer than you expected: brief and clear text often requires more time to write than rambling prose.

Your teacher can tell you what word processors you may use to write your report. Chances are that you can write your report in a number of formats, and for simplicity's sake, you might even want to write it using Notepad.

Enjoy!

详细过程

题目要求：得到Key1 — Key4 四个密钥的值

1.start && stride 和 dummy 分析

首先我们先来看看代码：

再来看看两个提取信息函数的内部代码：

由此可见，两个提取信息函数的作用就是将data数组中的数据转化为message数组，然后返回message并且让msg1和msg2等于massage

那么我们破译的第一步应该是弄明白start和stride的值

由这一行代码：

我们可知start和stride的值是由dummy来决定的，dummy的数据类型为int，且：start等于dummy最低地址那一字节对应的数值，stride等于dummy第二低地址对应的数值。（int型的dummy总共四个字节，通过char*进行强制转化后，只保留最低位的一个字节，其他字节丢失，”+1“的含义：设指针是p，“整型值”是n，那么p+n表示指针p的指向向后移动了n个p所指向的元素所占有的空间值。所以+1即向后移动了一个char型所占字节，即一个字节）

因此问题就转变为求dummy的值

2.key1求解

这时候我们看一下exercise给我们的提示：

In breaking the first two keys, realize that the function process_keys12 must be somehow changing the value of the dummy variable. This must be so, because the variables start and stride control the extraction of the message, and they are calculated from the value of dummy.

在破译前两个key时，要意识到函数 process_keys12 以某种方式改变了 dummy 变量的值。这是必须的，因为变量 start '和' stride 控制着消息的提取，它们是从 dummy 的值计算出来的。

于是我们来看看process_keys12这个函数：

不难知道，这个函数的作用是：

把key2的值存放在key1存储的地址往下偏移（ key1值 * 4）个字节的地址的位置，即偏移量就是key1的值

（key1是main中key1的地址，*key1等于main中的key1的值）

由于exercise的提示中已经明确：函数 process_keys12 以某种方式改变了 dummy 变量的值。所以我们不难知道，dummy变量的地址值就是 process_keys12函数中赋值表达式的左侧部分，即：

((int *) (key1 + *key1))

而该函数便是通过修改dummy的地址的位置，来改变dummy的值。

我们通过观察下图：

可以知道：dummy 与 key1的地址间隔相差了三个int类型

（连续定义n个变量，则栈空间会为这些变量分配连续的内存，因此他们的地址是连续的，由于栈是由高到低生长的，所以先被定义的dummy变量地址高，而后被定义的key1变量地址低，所以才有dummy的地址为 *((int *) (key1 + *key1)) ，其中 *key1就是key1变量的值）

因此我们可以知道key1的值就是dummy与key1地址的相差值，所以key1 = 3

更简单的：可以通过编译器Debug来查看dummy和key1的地址差值，就可以得到key1的值。但是如果用VS的默认编译选项，我们会发现定义的变量的地址并不是连续的，这时候可以去更改一下默认编译选项。

没有更改默认编译选项的 key1 的值为 9 ：

3.key2求解

我们先来看一下提示：

007 has been unable to find the keys, but from the desk of the encrypting personnel he was able to cunningly retrieve the first five characters of the unencoded message. These characters are:

From:

007一直没能找到钥匙，但他从加密人员的办公桌上巧妙地找回了未加密信息的前五个字符。这些字符是: From:

从信息中得知：未加密信息的前五个字符为From:

接着我们来看看start和stride分别是代表什么：

该函数的作用是：把一个int型的数组data中的每一个int数值转换为4个字符（一个int型占用4个字节，一个char型占用1个字节），最终从得到的char数组中读取部分字符放入数组message中，当读到‘\0’字符时结束

通过分析我们可以知道：

start的作用：从转换后的char数组的start+1下标位置开始读取数据。

stride的作用：当stride>1时，每读取stride-1个字符，隔一个不读（在第二层循环中，每一次结束后k的值都增加1，当k的值满足k>=stride的时候，就跳出第二层循环，并完成第一层循环中的j++操作，因此跳过了一个字符，没有对其进行读取），一直这样循环下去知道遇到'\0'；当stride=1时候就什么也不读取了（因为k<stride，所以stride要大于1才能进行第一次读取）。

于是我们先从头到尾打印一下data数组转化成为char数组的值（即进行全部数据的遍历打印）：

得到如下输出结果：

cccccccccFFrromo: mFr:ie ndC
TTo:E Y
ouT
Gooo:d!  NYowo turycEhoxoscineg lkelyse3,n4 tto! fYoroceu a  cgalol tto  eextvraectr2 yantd
havioind gth!e caxllx txo xexxtrxacxt1x

我们从第一行可以看出：From:这个目标值的可以拆分为Fr（第十一个和第十二个）、om（第十四个和第十五个）、：（第十七个）

即一次读取两个，然后按规定跳过一个字符，即k最大可以为2，所以stride=3，由于是从下标10（11-1）开始读取的，所以start+1=10，所以start等于9

结合上面的分析：

start等于dummy最低地址那一字节中的数值，stride等于dummy第二低地址那一字节中的数值。

所以我们知道start对应的"09"，stride对应为"03"，所以dummy的值为 0xXXXX0309（十六进制）（4个字节，因为dummy是int类型的）

那前面的四个数字我们如何得到呢？这时候我们只能碰碰运气了，但还好题目设置的比较巧妙，我们先从最小的数字开始，让XXXX为0000，那么我们得到的dummy值为777（将0x00000309转化为十进制，就是777），于是我们将 9 (在vs编译器中得到的key1的值)和777设置为设置命令行参数：

可以得到如下结果：

由于上面的分析已经得知：

*((int *) (key1 + *key1)) = *key2//((int *) (key1 + *key1))就是dummy的地址值

即dummy十六进制的值转化为十进制后即为key2的值，所以key2的值就是777

4.key3求解

我们接着来看提示：

In breaking the third and fourth keys, try to get the code to invoke extract_message2 instead of extract_message1. This modification must somehow be controlled from within the function process_keys34.

在破译第三和第四个键时，尝试获得调用' extract_message2 ' *而不是' extract_message1 ' *的代码。这种修改必须在' process_keys34 '函数中进行控制。

我们来看看main函数中的这段代码：

意思是我们需要让msg的值为'\0'才能进入这个if循环，那我们来看看如何让这个条件实现：

如果通过修改start和stride的值，能否实现这一条件呢？

我们来看看data数组中的数据：

很显然，只有最后一个数据的前两位00满足条件，所以不可能通过修改start和stride的值来满足这个条件，那我们该怎么办呢？

这时候我们来分析一下这个函数：

不难知道，extract_message2函数的作用是：把一个int型的数组中的每一个int数值转换为4个字符，最终从得到的char数组中读取部分字符放入数组message中，当读到‘\0’字符时结束。（start：从char数组角标为start的元素开始；stride：每隔stride-1个元素读取一个）

由于题目已经明确说明了：尝试获得调用' extract_message2 ' *而不是' extract_message1 ' *的代码，所以if语句中的extract_message2必须执行，而if语句之前的extract_message1不能执行，但是我们又无法满足if的条件，这时候我们就想到了：通过修改函数的返回地址来跳转语句，使其直接跳转到extract_message2函数的执行语句，也就是需要修改该函数的返回地址：

那我们来看一下process_keys34函数：

和之前修改dummy的值非常相似，由于我们的目的是修改函数的返回地址，于是我们很容易知道该函数的赋值语句的左侧为函数的返回地址：

(((int *)&key3) + *key3)

我们来分析一下这个赋值表达式的含义：

获得形参key3的地址。
转为一个int型指针并加上一个常数（key3的值），也就是偏移到内存中另一个地址。
然后修改了这处地址上的值。

那这跟修改函数的返回地址又有什么关系呢？这时候我们需要知道函数调用栈的结构：

所以我们知道该函数的调用栈为：

即key3和返回地址相差一个地址，结合(((int *)&key3) + *key3)，我们可知key3=-1

5.key4求解

key4求解方法就比较简单：由于函数的返回地址改变了key4个单位，所以我们只需要知道函数前后的返回地址，将两个返回地址相减，就可以得到key4的值。

于是我们查看一下汇编代码：

if (key3 != 0 && key4 != 0) {
00D61C6C  cmp         dword ptr [key3],0
00D61C70  je          main+148h (0D61C88h)
00D61C72  cmp         dword ptr [key4],0
00D61C76  je          main+148h (0D61C88h)  process_keys34(&key3, &key4);
00D61C78  lea         eax,[key4]
00D61C7B  push        eax
00D61C7C  lea         ecx,[key3]
00D61C7F  push        ecx
00D61C80  call        _process_keys34 (0D61217h)
00D61C85  add         esp,8  }
msg1 = extract_message1(start, stride);
00D61C88  mov         eax,dword ptr [stride]
00D61C8B  push        eax
00D61C8C  mov         ecx,dword ptr [start]
00D61C8F  push        ecx
00D61C90  call        _extract_message1 (0D6136Bh)
00D61C95  add         esp,8
00D61C98  mov         dword ptr [msg1],eax
if (*msg1 == '\0') {
00D61C9B  mov         eax,dword ptr [msg1]
00D61C9E  movsx       ecx,byte ptr [eax]
00D61CA1  test        ecx,ecx
00D61CA3  jne         main+19Bh (0D61CDBh)  process_keys34(&key3, &key4);
00D61CA5  lea         eax,[key4]
00D61CA8  push        eax
00D61CA9  lea         ecx,[key3]
00D61CAC  push        ecx
00D61CAD  call        _process_keys34 (0D61217h)
00D61CB2  add         esp,8  msg2 = extract_message2(start, stride);
00D61CB5  mov         eax,dword ptr [stride]
00D61CB8  push        eax
00D61CB9  mov         ecx,dword ptr [start]

于是：key4=00D61C85 - 00D61CB2 = 45

至此，我们得到了所有key的值为：

key1 = 9，key2 = 777， key3 = -1， key4 = 45（vs编译器对应的结果）

最后我们看看密文究竟是什么：

大功告成！恭喜我们顺利完成了exercise.

总结

总体来说，CMU的这个exercise还是颇有难度，需要掌握如下几个关键知识：

C语言的相关知识，要读得懂代码含义（指针等）

汇编和反汇编的基础知识
函数调用栈的结构
程序运行过程中变量的存储方式、位置（堆和栈）
VS等编译器的熟练使用
较好的逻辑分析能力和英语阅读水平