计算机组成与设计project1,计算机组成与体系结构(性能设计)答案完整版-第八版...

计算机组成与体系结构(性能设计)答案完整版-第八版

SOLUTIONS MANUAL COMPUTER ORGANIZATION AND ARCHITECTURE DESIGNING FOR PERANCE EIGHTH EDITION WILLIAM STALLINGS Copyright 2009: William Stallings -2- © 2009 by William Stallings All rights reserved. No part of this document may be reproduced, in any or by any means, or posted on the Internet, without permission in writing from the author. Selected solutions may be shared with students, provided that they are not available, unsecured, on the Web. -3- NOTICE This manual contains solutions to the review questions and homework problems in Computer Organization and Architecture, Eighth Edition. If you spot an error in a solution or in the wording of a problem, I would greatly appreciate it if you would forward the ination via email to ws@. An errata sheet for this manual, if needed, is available at . File name is S-COA8e-mmyy W.S. -4- Chapter 1 Introduction.5 Chapter 2 Computer Evolution and Perance.6 Chapter 3 Computer Function and Interconnection14 Chapter 4 Cache Memory19 Chapter 5 Internal Memory.32 Chapter 6 External Memory38 Chapter 7 /Output43 Chapter 8 Operating System Support50 Chapter 9 Computer Arithmetic.57 Chapter 10 Instruction Sets: Characteristics and Functions.69 Chapter 11 Instruction Sets: Addressing Modes and ats.80 Chapter 12 Processor Structure and Function85 Chapter 13 Reduced Instruction Set Computers92 Chapter 14 Instruction-Level Parallelism and Superscalar Processors.97 Chapter 15 Control Unit Operation103 Chapter 16 Microprogrammed Control.106 Chapter 17 Parallel Processing109 Chapter 18 Multicore Computers.118 Chapter 19 Number Systems.121 Chapter 20 Digital Logic122 Chapter 21 The IA-64 Architecture 126 Appendix B Assembly Language and Related Topics130 TABLE OF CONTENTS -5- CHAPTER 1 INTRODUCTION A A NSWERS TO NSWERS TO QQUESTIONSUESTIONS 1.1 Computer architecture refers to those attributes of a system visible to a programmer or, put another way, those attributes that have a direct impact on the logical cution of a program. Computer organization refers to the operational units and their interconnections that realize the architectural specifications. Examples of architectural attributes include the instruction set, the number of bits used to represent various data types (e.g., numbers, characters), I/O mechanisms, and techniques for addressing memory. Organizational attributes include those hardware details transparent to the programmer, such as control signals; interfaces between the computer and peripherals; and the memory technology used. 1.2 Computer structure refers to the way in which the components of a computer are interrelated. Computer function refers to the operation of each individual component as part of the structure. 1.3 Data processing; data storage; data movement; and control. 1.4 Central processing unit (CPU): Controls the operation of the computer and pers its data processing functions; often simply referred to as processor. Main memory: Stores data. I/O: Moves data between the computer and its external environment. System interconnection: Some mechanism that provides for communication among CPU, main memory, and I/O. A common example of system interconnection is by means of a system bus, consisting of a number of conducting wires to which all the other components attach. 1.5 Control unit: Controls the operation of the CPU and hence the computer Arithmetic and logic unit (ALU): Pers the computer’s data processing functions Registers: Provides storage internal to the CPU CPU interconnection: Some mechanism that provides for communication among the control unit, ALU, and registers -6- CHAPTER 2 COMPUTER EVOLUTION AND PERANCE A A NSWERS TO NSWERS TO QQUESTIONSUESTIONS 2.1 In a stored program computer, programs are represented in a suitable for storing in memory alongside the data. The computer gets its instructions by reading them from memory, and a program can be set or altered by setting the values of a portion of memory. 2.2 A main memory, which stores both data and instructions: an arithmetic and logic unit (ALU) capable of operating on binary data; a control unit, which interprets the instructions in memory and causes them to be cuted; and and output (I/O) equipment operated by the control unit. 2.3 Gates, memory cells, and interconnections among gates and memory cells. 2.4 Moore observed that the number of transistors that could be put on a single chip was doubling every year and correctly predicted that this pace would continue into the near future. 2.5 Similar or identical instruction set: In many cases, the same set of machine instructions is supported on all members of the family. Thus, a program that cutes on one machine will also cute on any other. Similar or identical operating system: The same basic operating system is available for all family members. Increasing speed: The rate of instruction cution increases in going from lower to higher family members. Increasing Number of I/O ports: In going from lower to higher family members. Increasing memory size: In going from lower to higher family members. Increasing cost: In going from lower to higher family members. 2.6 In a microprocessor, all of the components of the CPU are on a single chip. A A NSWERS TO NSWERS TO P P ROBLEMSROBLEMS 2.1 This program is developed in [HAYE98]. The vectors A, B, and C are each stored in 1,000 contiguous locations in memory, beginning at locations 1001, 2001, and 3001, respectively. The program begins with the left half of location 3. A counting variable N is set to 999 and decremented after each step until it reaches –1. Thus, the vectors are processed from high location to low location. -7- Location Instruction Comments 0 999 Constant (count N) 1 1 Constant 2 1000 Constant 3L LOAD M(2000) Transfer A(I) to AC 3R ADD M(3000) Compute A(I) + B(I) 4L STOR M(4000) Transfer sum to C(I) 4R LOAD M(0) Load count N 5L SUB M(1) Decrement N by 1 5R JUMP+ M(6, 20:39) Test N and branch to 6R if nonnegative 6L JUMP M(6, 0:19) Halt 6R STOR M(0) Update N 7L ADD M(1) Increment AC by 1 7R ADD M(2) 8L STOR M(3, 8:19) Modify address in 3L 8R ADD M(2) 9L STOR M(3, 28:39) Modify address in 3R 9R ADD M(2) 10L STOR M(4, 8:19) Modify address in 4L 10R JUMP M(3, 0:19) Branch to 3L 2.2 a. Opcode Operand 00000001 000000000010 b. First, the CPU must make access memory to fetch the instruction. The instruction contains the address of the data we want to load. During the cute phase accesses memory to load the data value located at that address for a total of two trips to memory. 2.3 To read a value from memory, the CPU puts the address of the value it wants into the MAR. The CPU then asserts the Read control line to memory and places the address on the address bus. Memory places the contents of the memory location passed on the data bus. This data is then transferred to the MBR. To write a value to memory, the CPU puts the address of the value it wants to write into the MAR. The CPU also places the data it wants to write into the MBR. The CPU then asserts the Write control line to memory and places the address on the address bus and the data on the data bus. Memory transfers the data on the data bus into the corresponding memory location. -8- 2.4 Address Contents 08A 08B 08C 08D LOAD M(0FA) STOR M(0FB) LOAD M(0FA) JUMP +M(08D) LOAD –M(0FA) STOR M(0FB) This program will store the absolute value of content at memory location 0FA into memory location 0FB. 2.5 All data paths to/from MBR are 40 bits. All data paths to/from MAR are 12 bits. Paths to/from AC are 40 bits. Paths to/from MQ are 40 bits. 2.6 The purpose is to increase perance. When an address is presented to a memory module, there is some time delay before the read or write operation can be pered. While this is happening, an address can be presented to the other module. For a series of requests for successive words, the maximum rate is doubled. 2.7 The discrepancy can be explained by noting that other system components aside from clock speed make a big difference in overall system speed. In particular, memory systems and advances in I/O processing contribute to the perance ratio. A system is only as fast as its slowest link. In recent years, the bottlenecks have been the perance of memory modules and bus speed. 2.8 As noted in the answer to Problem 2.7, even though the Intel machine may have a faster clock speed (2.4 GHz vs. 1.2 GHz), that does not necessarily mean the system will per faster. Different systems are not comparable on clock speed. Other factors such as the system components (memory, buses, architecture) and the instruction sets must also be taken into account. A more accurate measure is to run both systems on a benchmark. Benchmark programs exist for certain tasks, such as running office applications, pering floating-point operations, graphics operations, and so on. The systems can be compared to each other on how long they take to complete these tasks. According to Apple Computer, the G4 is comparable or better than a higher-clock speed Pentium on many benchmarks. 2.9 This representation is wasteful because to represent a single decimal digit from 0 through 9 we need to have ten tubes. If we could have an arbitrary number of these tubes ON at the same time, then those same tubes could be treated as binary bits. With ten bits, we can represent 210 patterns, or 1024 patterns. For integers, these patterns could be used to represent the numbers from 0 through 1023. 2.10 CPI = 1.55; MIPS rate = 25.8; cution time = 3.87 ns. Source: [HWAN93] -9- 2.11 a. € CPIA= CPIi× Ii∑ Ic = 8×1+ 4 × 3+ 2× 4 + 4 × 3()×106 8+ 4 + 2+ 4()×106 ≈ 2.22 MIPSA= f CPIA×106 = 200×106 2.22×106 = 90 CPUA= Ic×CPIA f = 18×106×2.2 200×106 = 0.2 s CPIB= CPIi× Ii∑ Ic = 10×1+ 8×2+ 2× 4 + 4 × 3()×106 10+ 8+ 2+ 4()×106 ≈1.92 MIPSB= f CPIB×106 = 200×106 1.92×106 =104 CPUB= Ic×CPIB f = 24 ×106×1.92 200×106 = 0.23 s b. Although machine B has a higher MIPS than machine A, it requires a longer CPU time to cute the same set of benchmark programs. 2.12 a. We can express the MIPs rate as: [(MIPS rate)/106] = Ic/T. So that: Ic = T × [(MIPS rate)/106]. The ratio of the instruction count of the RS/6000 to the VAX is [x × 18]/[12x × 1] = 1.5. b. For the Vax, CPI = (5 MHz)/(1 MIPS) = 5. For the RS/6000, CPI = 25/18 = 1.39. 2.13 From Equation (2.2), MIPS = Ic/(T × 106) = 100/T. The MIPS values are: Computer A Computer B Computer C Program 1 100 10 5 Program 2 0.1 1 5 Program 3 0.2 0.1 2 Program 4 1 0.125 1 Arithmetic mean Rank Harmonic mean Rank Computer A 25.325 1 0.25 2 Computer B 2.8 3 0.21 3 Computer C 3.26 2 2.1 1 -10- 2.14 a. Normalized to R: Processor Benchmark R M Z E 1.00 1.71 3.11 F 1.00 1.19 1.19 H 1.00 0.43 0.49 I 1.00 1.11 0.60 K 1.00 2.10 2.09 Arithmetic mean 1.00 1.31 1.50 b. Normalized to M: Processor Benchmark R M Z E 0.59 1.00 1.82 F 0.84 1.00 1.00 H 2.32 1.00 1.13 I 0.90 1.00 0.54 K 0.48 1.00 1.00 Arithmetic mean 1.01 1.00 1.10 c. Recall that the larger the ratio, the higher the speed. Based on (a) R is the slowest machine, by a significant amount. Based on (b), M is the slowest machine, by a modest amount. d. Normalized to R: Processor Benchmark R M Z E 1.00 1.71 3.11 F 1.00 1.19 1.19 H 1.00 0.43 0.49 I 1.00 1.11 0.60 K 1.00 2.10 2.09 Geometric mean 1.00 1.15 1.18 -11- Normalized to M: Processor Benchmark R M Z E 0.59 1.00 1.82 F 0.84 1.00 1.00 H 2.32 1.00 1.13 I 0.90 1.00 0.54 K 0.48 1.00 1.00 Geometric mean 0.87 1.00 1.02 Using the geometric mean, R is the slowest no matter which machine is used for normalization. 2.15 a. Normalized to X: Processor Benchmark X Y Z 1 1 2.0 0.5 2 1 0.5 2.0 Arithmetic mean 1 1.25 1.25 Geometric mean 1 1 1 Normalized to Y: Processor Benchmark X Y Z 1 0.5 1 0.25 2 2.0 1 4.0 Arithmetic mean 1.25 1 2.125 Geometric mean 1 1 1 Machine Y is twice as fast as machine X for benchmark 1, but half as fast for benchmark 2. Similarly machine Z is half as fast as X for benchmark 1, but twice as fast for benchmark 2. Intuitively, these three machines have equivalent perance. However, if we normalize to X and compute the arithmetic mean -12- of the speed metric, we find that Y and Z are 25% faster than X. Now, if we normalize to Y and compute the arithmetic mean of the speed metric, we find that X is 25% faster than Y and Z is more than twice as fast as Y. Clearly, the arithmetic mean is worthless in this context. b. When the geometric mean is used, the three machines are shown to have equal perance when normalized to X, and also equal perance when normalized to Y. These results are much more in line with our intuition. 2.16 a. Assuming the same instruction mix means that the additional instructions for each task should be allocated proportionally among the instruction types. So we have the following table: Instruction Type CPI Instruction Mix Arithmetic and logic 1 60% Load/store with cache hit 2 18% Branch 4 12% Memory reference with cache miss 12 10% CPI = 0.6 + (2 × 0.18) + (4 × 0.12) + (12 × 0.1) = 2.64. The CPI has increased due to the increased time for memory access. b. MIPS = 400/2.64 = 152. There is a corresponding drop in the MIPS rate. c. The speedup factor is the ratio of the cution times. Using Equation 2.2, we calculate the cution time as T = Ic/(MIPS × 106). For the single-processor case, T1 = (2 × 106)/(178 × 106) = 11 ms. With 8 processors, each processor cutes 1/8 of the 2 million instructions plus the 25,000 overhead instructions. For this case, the cution time for each of the 8 processors is € T8= 2×106 8 + 0.025×106 152×106 =1.8 ms Therefore we have € Speedup = time to cute program on a single processor time to cute program on N parallel processors = 11 1.8 = 6.11 d. The answer to this question depends on how we interpret Amdahl s law. There are two inefficiencies in the parallel system. First, there are additional instructions added to coordinate between threads. Second, there is contention for memory access. The way that the problem is stated, none of the code is inherently serial. All of it is parallelizable, but with scheduling overhead. One could argue that the memory access conflict means that to some extent memory reference instructions are not parallelizable. But based on the ination given, it is not clear how to quantify this effect in Amdahl s equation. If we assume that the fraction of code that is parallelizable is f = 1, then Amdahl s law reduces to Speedup = N =8 for this case. Thus the actual speedup is only about 75% of the theoretical speedup. -13- 2.17 a. Speedup = (time to access in main memory)/(time to access in cache) = T2/T1. b. The average access time can be computed as T = H × T1 + (1 – H) × T2 Using Equation (2.8): € Speedup = cution time before enhancement cution time after enhancement = T2 T = T2 H ×T1+ 1− H()T2 = 1 1− H()+ H T1 T2 c. T = H × T1 + (1 – H) × (T1 + T2) = T1 + (1 – H) × T2) This is Equation (4.2) in Chapter 4. Now, € Speedup = cution time before enhancement cution time after enhancement = T2 T = T2 T1+ 1− H()T2 = 1 1− H()+ T1 T2 In this case, the denominator is larger, so that the speedup is less. -14- CHAPTER 3 COMPUTER FUNCTION AND INTERCONNECTION A A NSWERS TO NSWERS TO QQUESTIONSUESTIONS 3.1 Processor-memory: Data may be transferred from processor to memory or from memory to processor. Processor-I/O: Data may be transferred to or from a peripheral device by transferring between the processor and an I/O module. Data processing: The processor may per some arithmetic or logic operation on data. Control: An instruction may specify that the sequence of cution be altered. 3.2 Instruction address calculation (iac): Determine the address of the next instruction to be cuted. Instruction fetch (if): Read instruction from its memory location into the processor. Instruction operation decoding (iod): Analyze instruction to determine type of operation to be pered and operand(s) to be used. Operand address calculation (oac): If the operation involves reference to an operand in memory or available via I/O, then determine the address of the operand. Operand fetch (of): Fetch the operand from memory or read it in from I/O. Data operation (do): Per the operation indicated in the instruction. Operand store (os): Write the result into memory or out to I/O. 3.3 (1) Disable all interrupts while an interrupt is being processed. (2) Define priorities for interrupts and to allow an interrupt of higher priority to cause a lower-priority interrupt handler to be interrupted. 3.4 Memory to processor: The processor reads an instruction or a unit of data from memory. Processor to memory: The processor writes a unit of data to memory. I/O to processor: The processor reads data from an I/O device via an I/O module. Processor to I/O: The processor sends data to the I/O device. I/O to or from memory: For these two cases, an I/O module is allowed to exchange data directly with memory, without going through the processor, using direct memory access (DMA). 3.5 With multiple buses, there are fewer devices per bus. This (1) reduces propagation delay, because each bus can be shorter, and (2) reduces bottleneck effects. 3.6 System pins: Include the clock and reset pins. Address and data pins: Include 32 lines that are time multipld for addresses and data. Interface control pins: Control the timing of transactions and provide coordination among initiators and targets. Arbitration pins: Unlike the other PCI signal lines, these are not shared lines. Rather, each PCI master has its own pair of arbitration lines that connect it directly to the PCI bus arbiter. Error Reporting pins: Used to report parity and -15- other errors. Interrupt Pins: These are provided for PCI devices that must generate requests for service. Cache support pins: These pins are needed to support a memory on PCI that can be cached in the processor or another device. 64-bit Bus extension pins: Include 32 lines that are time multipld for addresses and data and that are combined with the mandatory address/data lines to a 64-bit address/data bus. JTAG/Boundary Scan Pins: These signal lines support testing procedures defined in IEEE Standard 1149.1. A A NSWERS TO NSWERS TO P P ROBLEMSROBLEMS 3.1 Memory (contents in hex): 300: 3005; 301: 5940; 302: 7006 Step 1: 3005 → IR; Step 2: 3 → AC Step 3: 5940 → IR; Step 4: 3 + 2 = 5 → AC Step 5: 7006 → IR; Step 6: AC → Device 6 3.2 1. a. The PC contains 300, the address of the first instruction. This value is loaded in to the MAR. b. The value in location 300 (which is the instruction with the value 1940 in hexadecimal) is loaded into the MBR, and the PC is incremented. These two steps can be done in parallel. c. The value in the MBR is loaded into the IR. 2. a. The address portion of the IR (940) is loaded into the MAR. b. The value in location 940 is loaded into the MBR. c. The value in the MBR is loaded into the AC. 3. a. The value in the PC (301) is loaded in to the MAR. b. The value in location 301 (which is the instruction with the value 5941) is loaded into the MBR, and the PC is incremented. c. The value in the MBR is loaded into the IR. 4. a. The address portion of the IR (941) is loaded into the MAR. b. The value in location 941 is loaded into the MBR. c. The old value of the AC and the value of location MBR are added and the result is stored in the AC. 5. a. The value in the PC (302) is loaded in to the MAR. b. The value in location 302 (which is the instruction with the value 2941) is loaded into the MBR, and the PC is incremented. c. The value in the MBR is loaded into the IR. 6. a. The address portion of the IR (941) is loaded into the MAR. b. The value in the AC is loaded into the MBR. c. The value in the MBR is stored in location 941. 3.3 a. 224 = 16 MBytes b. (1) If the local address bus is 32 bits, the whole address can be transferred at once and decoded in memory. However, because the data bus is only 16 bits, it will require 2 cycles to fetch a 32-bit instruction or operand. (2) The 16 bits of the address placed on the address bus can t access the whole memory. Thus a more complex memory interface control is needed to latch the first part of the address and then the second part (because the microprocessor will end in two steps). For a 32-bit address, one may assume the first half will decode to access a “row“ in memory, while the second half is sent later to access -16- a “column“ in memory. In addition to the two-step address operation, the microprocessor will need 2 cycles to fetch the 32 bit instruction/operand. c. The program counter must be at least 24 bits. Typically, a 32-bit microprocessor will have a 32-bit external address bus and a 32-bit program counter, unless on- chip segment registers are used that may work with a smaller program counter. If the instruction register is to contain the whole instruction, it will have to be 32-bits long; if it will contain only the op code (called the op code register) then it will have to be 8 bits long. 3.4 In cases (a) and (b), the microprocessor will be able to access 216 = 64K bytes; the only difference is that with an 8-bit memory each access will transfer a byte, while with a 16-bit memory an access may transfer a byte or a 16-byte word. For case (c), separate and output instructions are needed, whose cution will generate separate “I/O signals“ (different from the “memory signals“ generated with the cution of memory-type instructions); at a minimum, one additional output pin will be required to carry this new signal. For case (d), it can support 28 = 256 and 28 = 256 output byte ports and the same number of and output 16-bit ports; in either case, the distinction between an and an output port is defined by the different signal that the cuted or output instruction generated. 3.5 Clock cycle = 1 8 MHz = 125 ns Bus cycle = 4 × 125 ns = 500 ns 2 bytes transferred every 500 ns; thus transfer rate = 4 MBytes/sec Doubling the frequency may mean adopting a new chip manufacturing technology (assuming each instructions will have the same number of clock cycles); doubling the external data bus means wider (maybe newer) on-chip data bus drivers/latches and modifications to the bus control logic. In the first case, the speed of the memory chips will also need to double (roughly) not to slow down the microprocessor; in the second case, the “wordlength“ of the memory will have to double to be able to send/receive 32-bit quantities. 3.6 a. from the Teletype is stored in INPR. The INPR will only accept data from the Teletype when FGI=0. When data arrives, it is stored in INPR, and FGI is set to 1. The CPU periodically checks FGI. If FGI =1, the CPU transfers the contents of INPR to the AC and sets FGI to 0. When the CPU has data to send to the Teletype, it checks FGO. If FGO = 0, the CPU must wait. If FGO = 1, the CPU transfers the contents of the AC to OUTR and sets FGO to 0. The Teletype sets FGI to 1 after the word is printed. b. The process described in (a) is very wasteful. The CPU, which is much faster than the Teletype, must repeatedly check FGI and FGO. If interrupts are used, the Teletype can issue an interrupt to the CPU whenever it is ready to accept or send data. The IEN register can be set by the CPU (under programmer control) 3.7 a. During a single bus cycle, the 8-bit microprocessor transfers one byte while the 16-bit microprocessor transfers two bytes. The 16-bit microprocessor has twice the data transfer rate. b. Suppose we do 100 transfers of operands and instructions, of which 50 are one byte long and 50 are two bytes long. The 8-bit microprocessor takes 50 + (2 x -17- 50) = 150 bus cycles for the transfer. The 16-bit microprocessor requires 50 + 50 = 100 bus cycles. Thus, the data transfer rates differ by a factor of 1.5. 3.8 The whole point of the clock is to define event times on the bus; therefore, we wish for a bus arbitration operation to be made each clock cycle. This requires that the priority signal propagate the length of the daisy chain (Figure 3.26) in one clock period. Thus, the maximum number of masters is determined by dividing the amount of time it takes a bus master to pass through the bus priority by the clock period. 3.9 The lowest-priority device is assigned priority 16. This device must defer to all the others. However, it may transmit in any slot not reserved by the other SBI devices. 3.10 At the beginning of any slot, if none of the TR lines is asserted, only the priority 16 device may transmit. This gives it the lowest average wait time under most circumstances. Only when there is heavy demand on the bus, which means that most of the time there is at least one pending request, will the priority 16 device not have the lowest average wait time. 3.11 a. With a clocking frequency of 10 MHz, the clock period is 10–9 s = 100 ns. The length of the memory read cycle is 300 ns. b. The Read signal begins to fall at 75 ns from the beginning of the third clock cycle (middle of the second half of T3). Thus, memory must place the data on the bus no later than 55 ns from the beginning of T3. 3.12 a. The clock period is 125 ns. Therefore, two clock cycles need to be inserted. b. From Figure 3.19, the Read signal begins to rise early in T2. To insert two clock cycles, the Ready line can be put in low at the beginning of T2 and kept low for 250 ns. 3.13 a. A 5 MHz clock corresponds to a clock period of 200 ns. Therefore, the Write signal has a duration of 150 ns. b. The data remain valid for 150 + 20 = 170 ns. c. One wait state. 3.14 a. Without the wait states, the instruction takes 16 bus clock cycles. The instruction requires four memory accesses, resulting in 8 wait states. The instruction, with wait states, takes 24 clock cycles, for an increase of 50%. b. In this case, the instruction takes 26 bus cycles without wait states and 34 bus cycles with wait states, for an increase of 33%. 3.15 a. The clock period is 125 ns. One bus read cycle takes 500 ns = 0.5 µs. If the bus cycles repeat one after another, we can achieve a data transfer rate of 2 MB/s. b. The wait state extends the bus read cycle by 125 ns, for a total duration of 0.625 µs. The corresponding data transfer rate is 1/0.625 = 1.6 MB/s. 3.16 A bus cycle takes 0.25 µs, so a memory cycle takes 1 µs. If both operands are even- aligned, it takes 2 µs to fetch the two operands. If one is odd-aligned, the time required is 3 µs. If both are odd-aligned, the time required is 4 µs. -18- 3.17 Consider a mix of 100 instructions and operands. On average, they consist of 20 32- bit items, 40 16-bit items, and 40 bytes. The number of bus cycles required for the 16-bit microprocessor is (2 × 20) + 40 + 40 = 120. For the 32-bit microprocessor, the number required is 100. This amounts to an improvement of 20/120 or about 17%. 3.18 The processor needs another nine clock cycles to complete the instruction. Thus, the Interrupt Acknowledge will start after 900 ns. 3.19 Address Bus Address PhaseAddress PhaseAddress PhaseAddress Phase Byte EnableByte EnableByte Enable Data-1Data-2Data-3 CLK 123456789 FRAME# AD C/BE# IRDY# TRDY# DEVSEL# Wait StateWait StateWait State Bus Transaction -19- CHAPTER 4 CACHE MEMORY A A NSWERS TO NSWERS TO QQUESTIONSUESTIONS 4.1 Sequential access: Memory is organized into units of data, called records. Access must be made in a specific linear sequence. Direct access: Individual blocks or records have a unique address based on physical location. Access is accomplished by direct access to reach a general vicinity plus sequential searching, counting, or waiting to reach the final location. Random access: Each addressable location in memory has a unique, physically wired-in addressing mechanism. The time to access a given location is independent of the sequence of prior accesses and is constant. 4.2 Faster access time, greater cost per bit; greater capacity, smaller cost per bit; greater capacity, slower access time. 4.3 It is possible to organize data across a memory hierarchy such that the percentage of accesses to each successively lower level is substantially less than that of the level above. Because memory references tend to cluster, the data in the higher- level memory need not change very often to satisfy memory access requests. 4.4 In a cache system, direct mapping maps each block of main memory into only one possible cache line. Associative mapping permits each main memory block to be loaded into any line of the cache. In set-associative mapping, the cache is divided into a number of sets of cache lines; each main memory block can be mapped into any line in a particular set. 4.5 One field identifies a unique word or byte within a block of main memory. The remaining two fields specify one of the blocks of main memory. These two fields are a line field, which identifies one of the lines of the cache, and a tag field, which identifies one of the blocks that can fit into that line. 4.6 A tag field uniquely identifies a block of main memory. A word field identifies a unique word or byte within a block of main memory. 4.7 One field identifies a unique word or byte within a block of main memory. The remaining two fields specify one of the blocks of main memory. These two fields are a set field, which identifies one of the sets of the cache, and a tag field, which identifies one of the blocks that can fit into that set. 4.8 Spatial locality refers to the tendency of cution to involve a number of memory locations that are clustered. Temporal locality refers to the tendency for a processor to access memory locations that have been used recently. -20- 4.9 Spatial locality is generally exploited by using larger cache blocks and by incorporating prefetching mechanisms (fetching items of anticipated use) into the cache control logic. Temporal locality is exploited by keeping recently used instruction and data values in cache memory and by exploiting a cache hierarchy. A A NSWERS TO NSWERS TO P P ROBLEMSROBLEMS 4.1 The cache is divided into 16 sets of 4 lines each. Therefore, 4 bits are needed to identify the set number. Main memory consists of 4K = 212 blocks. Therefore, the set plus tag lengths must be 12 bits and therefore the tag length is 8 bits. Each block contains 128 words. Therefore, 7 bits are needed to specify the word. TAG SET WORD Main memory address = 8 4 7 4.2 There are a total of 8 kbytes/16 bytes = 512 lines in the cache. Thus the cache consists of 256 sets of 2 lines each. Therefore 8 bits are needed to identify the set number. For the 64-Mbyte main memory, a 26-bit address is needed. Main memory consists of 64-Mbyte/16 bytes = 222 blocks. Therefore, the set plus tag lengths must be 22 bits, so the tag length is 14 bits and the word field length is 4 bits. TAG SET WORD Main memory address = 14 8 4 4.3 Address 111111 666666 BBBBBB a. Tag/Line/Word 11/444/1 66/1999/2 BB/2EEE/3 b. Tag /Word 44444/1 199999/2 2EEEEE/3 c. Tag/Set/Word 22/444/1 CC/1999/2 177/EEE/3 4.4 a. Address length: 24; number of addressable units: 224; block size: 4; number of blocks in main memory: 222; number of lines in cache: 214; size of tag: 8. b. Address length: 24; number of addressable units: 224; block size: 4; number of blocks in main memory: 222; number of lines in cache: 4000 hex; size of tag: 22. c. Address length: 24; number of addressable units: 224; block size: 4; number of blocks in main memory: 222; number of lines in set: 2; number of sets: 213; number of lines in cache: 214; size of tag: 9. 4.5 Block frame size = 16 bytes = 4 doublewords Number of block frames in cache = 16 KBytes 16 Bytes = 1024 Number of sets = Number of block frames Associativity = 1024 4 = 256 sets -21- OffsetSetTag 20 bits84 Decoder Comp1 Comp2 Comp3 Comp4 8 20 Set 0 Set 1 Set 255 • • • • • • Set 0 Set 1 Set 255 Tag (20)4 DWs Hit 4 Example: doubleword from location ABCDE8F8 is mapped onto: set 143, any line, doubleword 2: (1000)ABCDE(1111) (1000) 8F8 Set = 143 -22- 4.6 12 bits10 bits 4.7 A 32-bit address consists of a 21-bit tag field, a 7-bit set field, and a 4-bit word field. Each set in the cache includes 3 LRU bits and four lines. Each line consists of 4 32-bit words, a valid bit, and a 21-bit tag. 4.8 a. 8 leftmost bits = tag; 5 middle bits = line number; 3 rightmost bits = byte number b. slot 3; slot 6; slot 3; slot 21 c. Bytes with addresses 0001 1010 0001 1000 through 0001 1010 0001 1111 are stored in the cache d. 256 bytes e. Because two items with two different memory addresses can be stored in the same place in the cache. The tag is used to distinguish between them. -23- 4.9 a. The bits are set according to the following rules with each access to the set: 1. If the access is to L0 or L1, B0 ← 1. 2. If the access is to L0, B1 ← 1. 3. If the access is to L1, B1 ← 0. 4. If the access is to L2 or L3, B0 ← 0. 5. If the access is to L2, B2 ← 1. 6. If the access is to L3, B2 ← 0. The replacement algorithm works as follows (Figure 4.15): When a line must be replaced, the cache will first determine whether the most recent use was from L0 and L1 or L2 and L3. Then the cache will determine which of the pair of blocks was least recently used and mark it for replacement. When the cache is initialized or ed all 128 sets of three LRU bits are set to zero. b. The 80486 divides the four lines in a set into two pairs (L0, L1 and L2, L3). Bit B0 is used to select the pair that has been least-recently used. Within each pair, one bit is used to determine which member of the pair was least-recently used. However, the ultimate selection only approximates LRU. Consider the case in which the order of use was: L0, L2, L3, L1. The least-recently used pair is (L2, L3) and the least-recently used member of that pair is L2, which is selected for replacement. However, the least-recently used line of all is L0. Depending on the access history, the algorithm will always pick the least-recently used entry or the second least-recently used entry. c. The most straightforward way to implement true LRU for a four-line set is to associate a two bit counter with each line. When an access occurs, the counter for that block is set to 0; all counters with values lower than the original value for the accessed block are incremented by 1. When a miss occurs and the set is not full, a new block is brought in, its counter is set to 0 and all other counters are incremented by 1. When a miss occurs and the set is full, the block with counter value 3 is replaced; its counter is set to 0 and all other counters are incremented by 1. This approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires 2N bits. A more efficient scheme can be designed which requires only N(N–1)/2 bits. The scheme operates as follows. Consider a matrix R with N rows and N columns, and take the upper-right triangular portion of the matrix, not counting the diagonal. For N = 4, we have the following layout: R(1,2) R(1,3) R(1,4) R(2,3) R(2,4) R(3,4) When line I is referenced, row I of R(I,J) is set to 1, and column I of R(J,I) is set to 0. The LRU block is the one for which the row is entirely equal to 0 (for those bits in the row; the row may be empty) and for which the column is entirely 1 (for all the bits in the column; the column may be empty). As can be seen for N = 4, a total of 6 bits are required. -24- 4.10 Block size = 4 words = 2 doublewords; associativity K = 2; cache size = 4048 words; C = 1024 block frames; number of sets S = C/K = 512; main memory = 64K × 32 bits = 256 Kbytes = 218 bytes; address = 18 bits. TagSet Word bits (6 bits)(9)(2) (1) Compare 0 Compare 1 Decoder Set 0 Set 511 • • • Tag (6)4 words Set 0 (8 words) Set 511 (8 words) • • • word select 4.11 a. Address at: Tag = 20 bits; Line = 6 bits; Word = 6 bits Number of addressable units = 2s+w = 232 bytes; number of blocks in main memory = 2s = 226; number of lines in cache 2r = 26 = 64; size of tag = 20 bits. b. Address at: Tag = 26 bits; Word = 6 bits Number of addressable units = 2s+w = 232 bytes; number of blocks in main memory = 2s = 226; number of lines in cache = undetermined; size of tag = 26 bits. c. Address at: Tag = 9 bits; Set = 17 bits; Word = 6 bits Number of addressable units = 2s+w = 232 bytes; Number of blocks in main memory = 2s = 226; Number of lines in set = k = 4; Number of sets in cache = 2d = 217; Number of lines in cache = k × 2d =219; Size of tag = 9 bits. 4.12 a. Because the block size is 16 bytes and the word size is 1 byte, this means there are 16 words per block. We will need 4 bits to indicate which word we want out of a block. Each cache line/slot matches a memory block. That means each cache slot contains 16 bytes. If the cache is 64Kbytes then 64Kbytes/16 = 4096 cache slots. To address these 4096 cache slots, we need 12 bits (212 = 4096). Consequently, given a 20 bit (1 MByte) main memory address: Bits 0-3 indicate the word offset (4 bits) Bits 4-15 indicate the cache slot (12 bits) Bits 16-19 indicate the tag (remaining bits) F0010 = 1111 0000 0000 0001 0000 Word offset = 0000 = 0 Slot = 0000 0000 0001 = 001 Tag = 1111 = F 01234 = 0000 0001 0010 0011 0100 Word offset = 0100 = 4 Slot = 0001 0010 0011 = 123 -25- Tag = 0000 = 0 CABBE = 1100 1010 1011 1011 1110 Word offset = 1110 = E Slot = 1010 1011 1011 = ABB Tag = 1100 = C b. We need to pick any address where the slot is the same, but the tag (and optionally, the word offset) is different. Here are two examples where the slot is 1111 1111 1111 Address 1: Word offset = 1111 Slot = 1111 1111 1111 Tag = 0000 Address = 0FFFF Address 2: Word offset = 0001 Slot = 1111 1111 1111 Tag = 0011 Address = 3FFF1 c. With a fully associative cache, the cache is split up into a TAG and a WORDOFFSET field. We no longer need to identify which slot a memory block might map to, because a block can be in any slot and we will search each cache slot in parallel. The word-offset must be 4 bits to address each individual word in the 16-word block. This leaves 16 bits leftover for the tag. F0010 Word offset = 0h Tag = F001h CABBE Word offset = Eh Tag = CABBh d. As computed in part a, we have 4096 cache slots. If we implement a two -way set associative cache, then it means that we put two cache slots into one set. Our cache now holds 4096/2 = 2048 sets, where each set has two slots. To address these 2048 sets we need 11 bits (211 = 2048). Once we address a set, we will simultaneously search both cache slots to see if one has a tag that matches the target. Our 20-bit address is now broken up as follows: Bits 0-3 indicate the word offset Bits 4-14 indicate the cache set Bits 15-20 indicate the tag F0010 = 1111 0000 0000 0001 0000 Word offset = 0000 = 0 Cache Set = 000 0000 0001 = 001 Tag = 11110 = 1 1110 = 1E CABBE = 1100 1010 1011 1011 1110 Word offset = 1110 = E Cache Set = 010 1011 1011 = 2BB Tag = 11001 = 1 1001 = 19 4.13 Associate a 2-bit counter with each of the four blocks in a set. Initially, arbitrarily set the four values to 0, 1, 2, and 3 respectively. When a hit occurs, the counter of the block that is referenced is set to 0. The other counters in the set with values -26- originally lower than the referenced counter are incremented by 1; the remaining counters are unchanged. When a miss occurs, the block in the set whose counter value is 3 is replaced and its counter set to 0. All other counters in the set are incremented by 1. 4.14 Writing back a line takes 30 + (7 × 5) = 65 ns, enough time for 2.17 single-word memory operations. If the average line that is written at least once is written more than 2.17 times, the write-back cache will be more efficient. 4.15 a. A reference to the first instruction is immediately followed by a reference to the second. b. The ten accesses to a[i] within the inner for loop which occur within a short interval of time. 4.16 Define Ci = Average cost per bit, memory level i Si = Size of memory level i Ti = Time to access a word in memory level i Hi = Probability that a word is in memory i and in no higher-level memory Bi = Time to transfer a block of data from memory level (i + 1) to memory level i Let cache be memory level 1; main memory, memory level 2; and so on, for a total of N levels of memory. Then Cs= CiSi i=1 N ∑ Si i=1 N ∑ The derivation of Ts is more complicated. We begin with the result from probability theory that: Expected Value of x = iPr x = 1[] i=1 N ∑ We can write: Ts = TiHi i=1 N ∑ We need to realize that if a word is in M1 (cache), it is read immediately. If it is in M2 but not M1, then a block of data is transferred from M2 to M1 and then read. Thus: T2 = B1 + T1 -27- Further T3 = B2 + T2 = B1 + B2 + T1 Generalizing: Ti= Bj+ T1 j=1 i−1 ∑ So Ts= BjHi () j=1 i−1 ∑ i=2 N ∑ + T1Hi i=1 N ∑ But Hi i=1 N ∑ = 1 Finally Ts= BjHi () j=1 i−1 ∑ i=2 N ∑ + T1 4.17 Main memory consists of 512 blocks of 64 words. Cache consists of 16 sets; each set consists of 4 slots; each slot consists of 64 words. Locations 0 through 4351 in main memory occupy blocks 0 through 67. On the first fetch sequence, block 0 through 15 are read into sets 0 through 15; blocks 16 through 31 are read into sets 0 through 15; blocks 32-47 are read into sets 0 through 15; blocks 48-63 are read into sets 0 through 15; and blocks 64-67 are read into sets 0 through 3. Because each set has 4 slots, there is no replacement needed through block 63. The last 4 groups of blocks involve a replacement. On each successive pass, replacements will be required in sets 0 through 3, but all of the blocks in sets 4 through 15 remain undisturbed. Thus, on each successive pass, 48 blocks are undisturbed, and the remaining 20 must read in. Let T be the time to read 64 words from cache. Then 10T is the time to read 64 words from main memory. If a word is not in the cache, then it can only be ready by first transferring the word from main memory to the cache and then reading the cache. Thus the time to read a 64-word block from cache if it is missing is 11T. We can now express the improvement factor as follows. With no cache Fetch time = (10 passes) (68 blocks/pass) (10T/block) = 6800T With cache Fetch time = (68) (11T) first pass + (9) (48) (T) + (9) (20) (11T) other passes = 3160T Improvement = 6800T 3160T = 2.15 -28- 4.18 a. Access 63 1 Miss Block 3 → Slot 3 Access 64 1 Miss Block 4 → Slot 0 Access 65-70 6 Hits Access 15 1 Miss Block 0 → Slot 0 First Loop Access 16 1 Miss Block 1 → Slot 1 Access 17-31 15 Hits Access 32 1 Miss Block 2 → Slot 2 Access 80 1 Miss Block 5 → Slot 1 Access 81-95 15 Hits Access 15 1 Hit Second Loop Access 16 1 Miss Block 1 → Slot 1 Access 17-31 15 hits Access 32 1 Hit Access 80 1 Miss Block 5 → Slot 1 Access 81-95 15 hits Access 15 1 Hit Third Loop Access 16 1 Miss Block 1 → Slot 1 Access 17-31 15 hits Access 32 1 Hit Access 80 1 Miss Block 5 → Slot 1 Access 81-95 15 hits Access 15 1 Hit Fourth Loop … Pattern continues to the Tenth Loop For lines 63-70 2 Misses 6 Hits First loop 15-32, 80-95 4 Misses 30 Hits Second loop 15-32, 80-95 2 Misses 32 Hits Third loop 15-32, 80-95 2 Misses 32 Hits Fourth loop 15-32, 80-95 2 Misses 32 Hits Fifth loop 15-32, 80-95 2 Misses 32 Hits Sixth loop 15-32, 80-95 2 Misses 32 Hits Seventh loop 15-32, 80-95 2 Misses 32 Hits Eighth loop 15-32, 80-95 2 Misses 32 Hits Ninth loop 15-32, 80-95 2 Misses 32 Hits Tenth loop 15-32, 80-95 2 Misses 32 Hits Total: 24 Misses 324 Hits Hit Ratio = 324/348 = 0.931 b. Access 63 1 Miss Block 3 → Set 1 Slot 2 Access 64 1 Miss Block 4 → Set 0 Slot 0 Access 65-70 6 Hits Access 15 1 Miss Block 0 → Set 0 Slot 1 First Loop Access 16 1 Miss Block 1 → Set 1 Slot 3 Access 17-31 15 Hits Access 32 1 Miss Block 2 → Set 0 Slot 0 Access 80 1 Miss Block 5 → Set 1 Slot 2 Access 81-95 15 Hits Access 15 1 Hit Second Loop Access 16-31 16 Hits Access 32 1 Hit Access 80-95 16 Hits … All hits for the next eight iterations -29- For lines 63-70 2 Misses 6 Hits First loop 15-32, 80-95 4 Misses 30 Hits Second loop 15-32, 80-95 0 Misses 34 Hits Third loop 15-32, 80-95 0 Misses 34 Hits Fourth loop 15-32, 80-95 0 Misses 34 Hits Fifth loop 15-32, 80-95 0 Misses 34 Hits Sixth loop 15-32, 80-95 0 Misses 34 Hits Seventh loop 15-32, 80-95 0 Misses 34 Hits Eighth loop 15-32, 80-95 0 Misses 34 Hits Ninth loop 15-32, 80-95 0 Misses 34 Hits Tenth loop 15-32, 80-95 0 Misses 34 Hits Total = 6 Misses 342 Hits Hit Ratio = 342/348 = 0.983 4.19 a. Cost = Cm × 8 × 106 = 8 × 103 ¢ = $80 b. Cost = Cc × 8 × 106 = 8 × 104 ¢ = $800 c. From Equation (4.1) : 1.1 × T1 = T1 + (1 – H)T2 (0.1)(100) = (1 – H)(1200) H = 1190/1200 4.20 a. Under the initial conditions, using Equation (4.1), the average access time is T1 + (1 - H) T2 = 1 + (0.05) T2 Under the changed conditions, the average access time is 1.5 + (0.03) T2 For improved perance, we must have 1 + (0.05) T2 > 1.5 + (0.03) T2 Solving for T2, the condition is T2 > 50 b. As the time for access when there is a cache miss become larger, it becomes more important to increase the hit ratio. 4.21 a. First, 2.5 ns are needed to determine that a cache miss occurs. Then, the required line is read into the cache. Then an additional 2.5 ns are needed to read the requested word. Tmiss = 2.5 + 50 + (15)(5) + 2.5 = 130 ns b. The value Tmiss from part (a) is equivalent to the quantity (T1 + T2) in Equation (4.1). Under the initial conditions, using Equation (4.1), the average access time is Ts = H × T1 + (1 – H) × (T1 + T2) = (0.95)(2.5) + (0.05)(130) = 8.875 ns Under the revised scheme, we have: -30- Tmiss = 2.5 + 50 + (31)(5) + 2.5 = 210 ns and Ts = H × T1 + (1 – H) × (T1 + T2) = (0.97)(2.5) + (0.03)(210) = 8.725 ns 4.22 There are three cases to consider: Location of referenced word Probability Total time for access in ns In cache 0.9 20 Not in cache, but in main memory (0.1)(0.6) = 0.06 60 + 20 = 80 Not in cache or main memory (0.1)(0.4) = 0.04 12ms + 60 + 20 = 12,000,080 So the average access time would be: Avg = (0.9)(20) + (0.06)(80) + (0.04)(12000080) = 480026 ns 4.23 a. Consider the cution of 100 instructions. Under write-through, this creates 200 cache references (168 read references and 32 write references). On average, the read references result in (0.03) × 168 = 5.04 read misses. For each read miss, a line of memory must be read in, generating 5.04 × 8 = 40.32 physical words of traffic. For write misses, a single word is written back, generating 32 words of traffic. Total traffic: 72.32 words. For write back, 100 instructions create 200 cache references and thus 6 cache misses. Assuming 30% of lines are dirty, on average 1.8 of these misses require a line write before a line read. Thus, total traffic is (6 + 1.8) × 8 = 62.4 words. The traffic rate: Write through = 0.7232 byte/instruction Write back = 0.624 bytes/instruction b. For write-through: [(0.05) × 168 × 8] + 32 = 99.2 → 0.992 bytes/instruction For write-back: (10 + 3) × 8 = 104 → 0.104 bytes/instruction c. For write-through: [(0.07) × 168 × 8] + 32 = 126.08 → 0.12608 bytes/instruction For write-back: (14 + 4.2) × 8 = 145.6 → 0.1456 bytes/instruction d. A 5% miss rate is roughly a crossover point. At that rate, the memory traffic is about equal for the two strategies. For a lower miss rate, write-back is superior. For a higher miss rate, write-through is superior. 4.24 a. One clock cycle equals 60 ns, so a cache access takes 120 ns and a main memory access takes 180 ns. The effective length of a memory cycle is (0.9 × 120) + (0.1 × 180) = 126 ns. b. The calculation is now (0.9 × 120) + (0.1 × 300) = 138 ns. Clearly the perance degrades. However, note that although the memory access time increases by 120 ns, the average access time increases by only 12 ns. 4.25 a. For a 1 MIPS processor, the average instruction takes 1000 ns to fetch and cute. On average, an instruction uses two bus cycles for a total of 600 ns, so the bus utilization is 0.6 b. For only half of the instructions must the bus be used for instruction fetch. Bus utilization is now (150 + 300)/1000 = 0.45. This reduces the waiting time for other bus requestors, such as DMA devices and other microprocessors. -31- 4.26 a. Ta = Tc + (1 – H)Tb + W(Tm – Tc) b. Ta = Tc + (1 – H)Tb + Wb(1 – H)Tb = Tc + (1 – H)(1 + Wb)Tb 4.27 Ta = [Tc1 + (1 – H1)Tc2] + (1 – H2)Tm 4.28 a. miss penalty = 1 + 4 = 5 clock cycles b. miss penalty = 4 × (1 + 4 ) = 20 clock cycles c. miss penalty = miss penalty for one word + 3 = 8 clock cycles. 4.29 The average miss penalty equals the miss penalty times the miss rate. For a line size of one word, average miss penalty = 0.032 x 5 = 0.16 clock cycles. For a line size of 4 words and the nonburst transfer, average miss penalty = 0.011 x 20 = 0.22 clock cycles. For a line size of 4 words and the burst transfer, average miss penalty = 0.011 x 8 = 0.132 clock cycles. -32- CHAPTER 5 INTERNAL MEMORY A A NSWERS TO NSWERS TO QQUESTIONSUESTIONS 5.1 They exhibit two stable (or semistable) states, which can be used to represent binary 1 and 0; they are capable of being written into (at least once), to set the state; they are capable of being read to sense the state. 5.2 (1) A memory in which individual words of memory are directly accessed through wired-in addressing logic. (2) Semiconductor main memory in which it is possible both to read data from the memory and to write new data into the memory easily and rapidly. 5.3 SRAM is used for cache memory (both on and off chip), and DRAM is used for main memory. 5.4 SRAMs generally have faster access times than DRAMs. DRAMS are less expensive and smaller than SRAMs. 5.5 A DRAM cell is essentially an analog device using a capacitor; the capacitor can store any charge value within a range; a threshold value determines whether the charge is interpreted as 1 or 0. A SRAM cell is a digital device, in which binary values are stored using traditional flip-flop logic-gate configurations. 5.6 Microprogrammed control unit memory; library subroutines for frequently wanted functions; system programs; function tables. 5.7 EPROM is read and written electrically; before a write operation, all the storage cells must be erased to the same initial state by exposure of the packaged chip to ultraviolet radiation. Erasure is pered by shining an intense ultraviolet light through a window that is designed into the memory chip. EEPROM is a read- mostly memory that can be written into at any time without erasing prior contents; only the byte or bytes addressed are updated. Flash memory is intermediate between EPROM and EEPROM in both cost and functionality. Like EEPROM, flash memory uses an electrical erasing technology. An entire flash memory can be erased in one or a few seconds, which is much faster than EPROM. In addition, it is possible to erase just blocks of memory rather than an entire chip. However, flash memory does not provide byte-level erasure. Like EPROM, flash memory uses only one transistor per bit, and so achieves the high density (compared with EEPROM) of EPROM. 5.8 A0 - A1 = address lines:. CAS = column address select:. D1 - D4 = data lines. NC: = no connect. OE: output enable. RAS = row address select:. Vcc: = voltage source. Vss: = ground. WE: write enable. -33- 5.9 A bit appended to an array of binary digits to make the sum of all the binary digits, including the parity bit, always odd (odd parity) or always even (even parity). 5.10 A syndrome is created by the XOR of the code in a word with a calculated version of that code. Each bit of the syndrome is 0 or 1 according to if there is or is not a match in that bit position for the two s. If the syndrome contains all 0s, no error has been detected. If the syndrome contains one and only one bit set to 1, then an error has occurred in one of the 4 check bits. No correction is needed. If the syndrome contains more than one bit set to 1, then the numerical value of the syndrome indicates the position of the data bit in error. This data bit is inverted for correction. 5.11 Unlike the traditional DRAM, which is asynchronous, the SDRAM exchanges data with the processor synchronized to an external clock signal and running at the full speed of the processor/memory bus without imposing wait states. A A NSWERS TO NSWERS TO P P ROBLEMSROBLEMS 5.1 The 1-bit-per-chip organization has several advantages. It requires fewer pins on the package (only one data out line); therefore, a higher density of bits can be achieved for a given size package. Also, it is somewhat more reliable because it has only one output driver. These benefits have led to the traditional use of 1-bit-per- chip for RAM. In most cases, ROMs are much smaller than RAMs and it is often possible to get an entire ROM on one or two chips if a multiple-bits-per-chip organization is used. This saves on cost and is sufficient reason to adopt that organization. 5.2 In 1 ms, the time devoted to refresh is 64 × 150 ns = 9600 ns. The fraction of time devoted to memory refresh is (9.6 × 10–6 s)/10–3 s = 0.0096, which is approximately 1%. 5.3 a. Memory cycle time = 60 + 40 = 100 ns. The maximum data rate is 1 bit every 100 ns, which is 10 Mbps. b. 320 Mbps = 40 MB/s. -34- 5.4 S0 S1 S2 S3 S4 S5 S6 S7 A0 A19 A20 A21 A22 Decoder Chip select 1 Mb Chip select 1 Mb Chip select 1 Mb Chip select 1 Mb Chip select 1 Mb Chip select 1 Mb Chip select 1 Mb Chip select 1 Mb • • • 5.5 a. The length of a clock cycle is 100 ns. Mark the beginning of T1 as time 0.Address Enable returns to a low at 75. € RAS goes active 50 ns later, or time 125. Data must become available by the DRAMs at time 300 – 60 = 240. Hence, access time must be no more than 240 – 125 = 115 ns. b. A single wait state will increase the access time requirement to 115 + 100 = 215 ns. This can easily be met by DRAMs with access times of 150 ns. 5.6 a. The refresh period from row to row must be no greater than 4000/256 = 15.625 µs. b. An 8-bit counter is needed to count 256 rows (28 = 256). 5.7 a. pulse a = write pulse b = write pulse c = write pulse d = write pulse e= write pulse f = write pulse g = store-disable outputs pulse h = read pulse i = read pulse j = read pulse k = read pulse l = read pulse m = read pulse n = store-disable outputs -35- b. Data is read in via pins (D3, D2, D1, D0) word 0 = 1111 (written into location 0 during pulse a) word 1 = 1110 (written into location 0 during pulse b) word 2 = 1101 (written into location 0 during pulse c) word 3 = 1100 (written into location 0 during pulse d) word 4 = 1011 (written into location 0 during pulse e) word 5 = 1010 (written into location 0 during pulse f) word 6 = random (did not write into this location 0) c. Output leads are (O3, O2, O1, O0) pulse h: 1111 (read location 0) pulse i: 1110 (read location 1) pulse j: 1101 (read location 2) pulse k: 1100 (read location 3) pulse l: 1011 (read location 4) pulse m: 1010 (read location 5) 5.8 8192/64 = 128 chips; arranged in 8 rows by 64 columns: • • • • • • • • • • • • • • • 017 112113119 Section 0 (even) • • • • • • • • • • • • • • • 8915 120121127 Section 1 (odd) A0 = LA0 = H Decoder Row 0 Row 1 Row 7 • • • 8 Rows All zeros Ak-A10A9-A7 3 A6-A1 6 AB En 8 8 Depends on type of processor 5.9 Total memory is 1 megabyte = 8 megabits. It will take 32 DRAMs to construct the memory (32 × 256 Kb = 8 Mb). The composite failure rate is 2000 × 32 = 64,000 FITS. From this, we get a MTBF = 109/64,000 = 15625 hours = 22 months. 5.10 The stored word is 001101001111, as shown in Figure 5.10. Now suppose that the only error is in C8, so that the fetched word is 001111001111. Then the received block results in the following table: Position 12 11 10 9 8 7 6 5 4 3 2 1 Bits D8 D7 D6 D5 C8 D4 D3 D2 C4 D1 C2 C1 Block 0 0 1 1 1 1 0 0 1 1 1 1 Codes 1010 1001 0111 0011 -36- The check bit calculation after reception: Position Code Hamming 1111 10 1010 9 1001 7 0111 3 0011 XOR = syndrome 1000 The nonzero result detects and error and indicates that the error is in bit position 8, which is check bit C8. 5.11 Data bits with value 1 are in bit positions 12, 11, 5, 4, 2, and 1: Position 12 11 10 9 8 7 6 5 4 3 2 1 Bits D8 D7 D6 D5 C8 D4 D3 D2 C4 D1 C2 C1 Block 1 1 0 0 0 0 1 0 Codes 1100 1011 0101 The check bits are in bit numbers 8, 4, 2, and 1. Check bit 8 calculated by values in bit numbers: 12, 11, 10 and 9 Check bit 4 calculated by values in bit numbers: 12, 7, 6, and 5 Check bit 2 calculated by values in bit numbers: 11, 10, 7, 6 and 3 Check bit 1 calculated by values in bit numbers: 11, 9, 7, 5 and 3 Thus, the check bits are: 0 0 1 0 5.12 The Hamming Word initially calculated was: bit number: 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 1 0 1 0 0 1 1 1 1 Doing an exclusive-OR of 0111 and 1101 yields 1010 indicating an error in bit 10 of the Hamming Word. Thus, the data word read from memory was 00011001. 5.13 Need K check bits such that 1024 + K ≤ 2K – 1. The minimum value of K that satisfies this condition is 11. -37- 5.14 As Table 5.2 indicates, 5 check bits are needed for an SEC code for 16-bit data words. The layout of data bits and check bits: Bit Position Position Number Check Bits Data Bits 21 10101 M16 20 10100 M15 19 10011 M14 18 10010 M13 17 10001 M12 16 10000 C16 15 01111 M11 14 01110 M10 13 01101 M9 12 01100 M8 11 01011 M7 10 01010 M6 9 01001 M5 8 01000 C8 7 00111 M4 6 00110 M3 5 00101 M2 4 00100 C4 3 00011 M1 2 00010 C2 1 00001 C1 The equations are calculated as before, for example, C1= M1 ⊕ M2 ⊕ M4 ⊕ M5 ⊕ M7 ⊕ M9 ⊕ M11 ⊕ M12 ⊕ M14 ⊕ M16. For the word 0101000000111001, the code is C16 = 1; C8 = 1; C 4 = 1; C2 = 1; C1 = 0. If an error occurs in data bit 4: C16 = 1 ; C8 =1; C4 = 0; C2 = 0; C1 = 1. Comparing the two: C16 C8 C4 C2 C1 1 1 1 1 0 1 1 0 0 1 0 0 1 1 1 The result is an error identified in bit position 7, which is data bit 4. -38- CHAPTER 6 EXTERNAL MEMORY A A NSWERS TO NSWERS TO QQUESTIONSUESTIONS 6.1 Improvement in the uniity of the magnetic film surface to increase disk reliability. A significant reduction in overall surface defects to help reduce read/write errors. Ability to support lower fly heights (described subsequently). Better stiffness to reduce disk dynamics. Greater ability to withstand shock and damage 6.2 The write mechanism is based on the fact that electricity flowing through a coil produces a magnetic field. Pulses are sent to the write head, and magnetic patterns are recorded on the surface below, with different patterns for positive and negative currents. An electric current in the wire induces a magnetic field across the gap, which in turn magnetizes a small area of the recording medium. Reversing the direction of the current reverses the direction of the magnetization on the recording medium. 6.3 The read head consists of a partially shielded magnetoresistive (MR) sensor. The MR material has an electrical resistance that depends on the direction of the magnetization of the medium moving under it. By passing a current through the MR sensor, resistance changes are detected as voltage signals. 6.4 For the constant angular velocity (CAV) system, the number of bits per track is constant. An increase in density is achieved with multiple zoned recording, in which the surface is divided into a number of zones, with zones farther from the center containing more bits than zones closer to the center. 6.5 On a magnetic disk. data is organized on the platter in a concentric set of rings, called tracks. Data are transferred to and from the disk in sectors. For a disk with multiple platters, the set of all the tracks in the same relative position on the platter is referred to as a cylinder. 6.6 512 bytes. 6.7 On a movable-head system, the time it takes to position the head at the track is known as seek time. Once the track is selected, the disk controller waits until the appropriate sector rotates to line up with the head. The time it takes for the beginning of the sector to reach the head is known as rotational delay. The sum of the seek time, if any, and the rotational delay equals the access time, which is the time it takes to get into position to read or write. Once the head is in position, the read or write operation is then pered as the sector moves under the head; this is the data transfer portion of the operation and the time for the transfer is the transfer time. -39- 6.8 1. RAID is a set of physical disk drives viewed by the operating system as a single logical drive. 2. Data are distributed across the physical drives of an array. 3. Redundant disk capacity is used to store parity ination, which guarantees data recoverability in case of a disk failure. 6.9 0: Non-redundant 1: Mirrored; every disk has a mirror disk containing the same data. 2: Redundant via Hamming code; an error-correcting code is calculated across corresponding bits on each data disk, and the bits of the code are stored in the corresponding bit positions on multiple parity disks. 3: Bit-interleaved parity; similar to level 2 but instead of an error-correcting code, a simple parity bit is computed for the set of individual bits in the same position on all of the data disks. 4: Block-interleaved parity; a bit-by-bit parity strip is calculated across corresponding strips on each data disk, and the parity bits are stored in the corresponding strip on the parity disk. 5: Block-interleaved distributed parity; similar to level 4 but distributes the parity strips across all disks. 6: Block- interleaved dual distributed parity; two different parity calculations are carried out and stored in separate blocks on different disks. 6.10 The disk is divided into strips; these strips may be physical blocks, sectors, or some other unit. The strips are mapped round robin to consecutive array members. A set of logically consecutive strips that maps exactly one strip to each array member is referred to as a stripe. 6.11 For RAID level 1, redundancy is achieved by having two identical copies of all data. For higher levels, redundancy is achieved by the use of error-correcting codes. 6.12 In a parallel access array, all member disks participate in the cution of every I/O request. Typically, the spindles of the individual drives are synchronized so that each disk head is in the same position on each disk at any given time. In an independent access array, each member disk operates independently, so that separate I/O requests can be satisfied in parallel. 6.13 For the constant angular velocity (CAV) system, the number of bits per track is constant. At a constant linear velocity (CLV), the disk rotates more slowly for accesses near the outer edge than for those near the center. Thus, the capacity of a track and the rotational delay both increase for positions nearer the outer edge of the disk. 6.14 1. Bits are packed more closely on a DVD. The spacing between loops of a spiral on a CD is 1.6 µm and the minimum distance between pits along the spiral is 0.834 µm. The DVD uses a laser with shorter wavelength and achieves a loop spacing of 0.74 µm and a minimum distance between pits of 0.4 µm. The result of these two improvements is about a seven-fold increase in capacity, to about 4.7 GB. 2. The DVD employs a second layer of pits and lands on top of the first layer A dual-layer DVD has a semireflective layer on top of the reflective layer, and by adjusting focus, the lasers in DVD drives can read each layer separately. This technique almost doubles the capacity of the disk, to about 8.5 GB. The lower reflectivity of the second layer limits its storage capacity so that a full doubling is not achieved. -40- 3. The DVD-ROM can be two sided whereas data is recorded on only one side of a CD. This brings total capacity up to 17 GB. 6.15 The typical recording technique used in serial tapes is referred to as serpentine recording. In this technique, when data are being recorded, the first set of bits is recorded along the whole length of the tape. When the end of the tape is reached, the heads are repositioned to record a new track, and the tape is again recorded on its whole length, this time in the opposite direction. That process continues, back and forth, until the tape is full. A A NSWERS TO NSWERS TO P P ROBLEMSROBLEMS 6.1 It will be useful to keep the following representation of the N tracks of a disk in mind: 0 1 • • • j – 1 • • • N – j • • • N – 2 N – 1 a. Let us use the notation Ps [j/t] = Pr [seek of length j when head is currently positioned over track t]. Recognize that each of the N tracks is equally likely to be requested. Therefore the unconditional probability of selecting any particular track is 1/N. We can then state: Ps[j /t]= 1 N if t ≤ j – 1 OR t ≥ N – j Ps[j /t]= 2 N if j – 1 1, can be optimal, even when perance grows by only r . For a given f, the maximum speedup can occur at one big core, n base cores, or with an intermediate number of middlesized cores. Recall that for n = 256 and f = 0.975, the maximum speedup occurs using 7.1 core equivalents per core. Implication 2. Researchers should seek s of increasing core perance even at a high cost. Result 3. Moving to denser chips increases the likelihood that cores will be nonminimal. Even at f = 0.99, minimal base cores are optimal at chip size n = 16, but more powerful cores help at n = 256. Implication 3. As Moore’s law leads to larger multicore chips, researchers should look for ways to design more powerful cores. -121- CHAPTER 19 NUMBER SYSTEMS A A NSWERS TO NSWERS TO P P ROBLEMSROBLEMS 19.1 a. 12 b. 3 c. 28 d. 60 e. 42 19.2 a. 28.375 b. 51.59375 c. 682.5 19.3 a. 1000000 b. 1100100 c. 1101111 d. 10010001 e. 11111111 19.4 a. 100010.11 b. 11001.01 c. 11011.0011 19.5 A BAD ADOBE FACADE FADED (Source: [KNUT98]) 19.6 a. 12 b. 159 c. 3410 d. 1662 e. 43981 19.7 a. 15.25 b. 211.875 c. 4369.0625 d. 2184.5 e. 3770.75 19.8 a. 10 b. 50 c. A00 d. BB8 e. F424 19.9 a. CC.2 b. FF.E c. 277.4 d. 2710.01 19.10 a. 1110 b. 11100 c. 101001100100 d. 11111.11 e. 1000111001.01 19.11 a. 9.F b. 35.64 c. A7.EC 19.12 1/2k = 5k/10k 19.13 a. 1, 2, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14, 15, 16, 17, 20, 21, 22, 23, 24 b. 1, 2, 3, 4, 5, 10, 11, 12, 13, 14, 15, 20, 21, 22, 23, 24, 25, 30, 31, 32 c. 1, 2, 3, 4, 10, 11, 12, 13, 14, 20, 21, 22, 23, 24, 30, 31, 32, 33, 34, 40 d. 1, 2, 10, 11, 12, 20, 21, 22, 100, 101, 102, 110, 111, 112, 120, 121, 122, 200, 201, 202 19.14 a. 134 b. 105 c. 363 d. 185 19.15 Given the representation of a number x in base n and base np, every p digits in the base n representation can be converted to a single base np digit. For example, the base 3 representation of 7710 is 2212 and the base 9 representation is 85. Thus it is easy to convert between a base n representation and a base np representation without the intermediate step of converting to base 10. In other cases, the intermediate step facilitates conversion. -122- CHAPTER 20 DIGITAL LOGIC A A NSWERS TO NSWERS TO P P ROBLEMSROBLEMS 20.1 A B C a b c d 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 1 1 0 0 0 1 0 1 1 1 1 1 0 0 20.2 Recall the commutative law: AB = BA; A + B = B + A a. A B + CDE + C DE b. AB + AC c. (LMN) (AB) (CDE) d. F (K + R) + SV + W X 20.3 a. F = V • A •L . This is just a generalization of DeMorgan s Theorem, and is easily proved. b. F = ABCD. Again, a generalization of DeMorgan s Theorem. 20.4 a. A = ST + VW b. A = TUV + Y c. A = F d. A = ST e. A = D + E f. A = YZ (W + X +YZ) = YZ g. A = C 20.5 A XOR B = A B + A B 20.6 ABC = NOR ( A , B , C ) 20.7 Y = NAND (A, B, C, D) = ABCD -123- 20.8 a. X1 X2 X3 X4 Z1 Z2 Z3 Z4 Z5 Z6 Z7 0 0 0 0 1 1 1 0 1 1 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 1 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 1 0 0 1 0 1 1 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 0 1 0 0 1 0 1 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 b. All of the terms have the illustrated as follows: Z5 = X1X2X3X4 + X1X2 X3X4 + X1 X2X3X4 + X1 X2X3X4 c. Whereas the SOP lists all combinations that produce an output of 1, the POS lists all combinations that produce an output of 0. For example, Z3 = ( X1X2X3X4) ( X1X2X3X4) = (X1X2 X3X4 ) (X1X2X3 X4) 20.9 Label the 8 s I0, . , I7 and the select lines S0, S1, S2. S1F = I0 + I1 S0S1S2+ I2 S0S1 S2 + I3 S0S1 S2 + I4 S0S1S2+ I5 S0S1S2 + I6 S0S1S2 + I7 S0 S1 S2 20.10 Add a data line and connect it to the side of each AND gate. -124- 20.11 Define the leads as B2, B1, B0 and the output leads as G2, G1, G0. Then G2 = B2 G1= B2B1+ B2B1 G0= B1B0+ B1B0 20.12 The is A4A3A2A1A0. Use A2A1A0 as the to each of the 3 × 8 decoders. There are a total of 32 outputs from these four 3 × 8 decoders. Use A4A3 as to a 2 × 4 decoder and have the four outputs go to the enable leads of the four 3 × 8 decoders. The result is that one and only one of the 32 outputs will have a value of 1. 20.13 SUM = A ⊕ B ⊕ C CARRY = AB ⊕ AC ⊕ BC 20.14 a. The carry to the second stage is available after 20 ns; the carry to the third stage is available 20 ns after that, and so on. When the carry reaches the 32nd stage, another 30 ns are needed to produce the final sum. Thus T = 31 × 20 + 30 = 650 ns b. Each 8-bit adder produces a sum in 30 ns and a carry in 20 ns. Therefore, T = 3 × 20 + 30 = 90 ns 20.15 a. Characteristic table Simplified characteristic table Current SR Current state Qn Next state Qn+1 S R Qn+1 00 0 — 0 0 — 00 1 — 0 1 1 01 0 1 1 0 0 01 1 1 1 1 Qn 10 0 0 10 1 0 11 0 0 11 1 1 b. t 0 1 2 3 4 5 6 7 8 9 S 0 1 1 1 1 1 0 1 0 1 R 1 1 0 1 0 1 1 1 0 0 Qn 0 0 1 1 1 1 0 0 — 1 -125- 20.16 D Data Clock Ck Q QR 20.17 C O3 BA O1O0O2 20.18 a. Use a PLA with 12-bit addresses and 96 8-bit locations. Each of the 96 locations is set to an ASCII code, and a character is converted by simply using its original, 12-bit code as an address to the PLA. The content of that address is the required ASCII code. b. Yes. This would require a 4K×8 ROM where only 96 of the 4096 locations are actually used. -126- CHAPTER 21 THE IA-64 ARCHITECTURE A A NSWERS TO NSWERS TO QQUESTIONSUESTIONS 21.1 I-unit: For integer arithmetic, shift-and-add, logical, compare, and integer multimedia instructions. M-unit: Load and store between register and memory plus some integer ALU operations. B-unit: Branch instructions. F-unit: Floating- point instructions. 21.2 The template field contains ination that indicates which instructions can be cuted in parallel. 21.3 A stop indicates to the hardware that one or more instructions before the stop may have certain kinds of resource dependencies with one or more instructions after the stop. 21.4 Predication is a technique whereby the compiler determines which instructions may cute in parallel. With predicated cution, every IA-64 instruction includes a reference to a 1-bit predicate register, and only cutes if the predicate value is 1 (true). 21.5 Predicates enable the processor to speculatively cute both branches of an if statement and only commit after the condition is determined. 21.6 With control speculation, a load instruction is moved earlier in the program and its original position replaced by a check instruction. The early load saves cycle time; if the load produces an exception, the exception is not activated until the check instruction determines if the load should have been taken. 21.7 Associated with each register is a NaT bit used to track deferred speculative exceptions. If a ld.s detects an exception, it sets the NaT bit associated with the target register. If the corresponding chk.s instruction is cuted, and if the NaT bit is set, the chk.s instruction branches to an exception-handling routine. 21.8 With data speculation, a load is moved before a store instruction that might alter the memory location that is the source of the load. A subsequent check is made to assure that the load receives the proper memory value. 21.9 Software pipelining is a technique in which instructions from multiple iterations of a loop are enabled to cute in parallel. Parallelism is achieved by grouping together instructions from different iterations. Hardware pipelining refers to the use of a physical pipeline as part of the hardware -127- 21.10 Rotating registers are used for software pipelining. During each iteration of a software-pipeline loop, register references within these ranges are automatically incremented. Stacked registers implement a stack. A A NSWERS TO NSWERS TO P P ROBLEMSROBLEMS 21.1 Eight. The operands and result require 7 bits each, and the controlling predicate 6. A major opcode is specified by 4 bits; 38 bits of the 41-bit syllable are committed, leaving 3 bits to specify a suboperation. Source: [MARK00] 21.2 Table 21.3 reveals that any opcode can be interpreted as referring to on of 6 different cution units (M, B, I, L, F, X). So, the potential maximum number of different major opcodes is 24 × 6 = 96. 21.3 16 21.4 a. Six cycles. The single floating-point unit is the limiting factor. b. Three cycles. 21.5 The pairing must not exceed a sum of two M or two I slots with the two bundles. For example, two bundles, both with template 00, or two bundles with templates 00 and 01 could not be paired because they require 4 I-units. Source: [EVAN03] 21.6 Yes. On IA-64s with fewer floating-point units, more cycles are needed to dispatch each group. On an IA-64 with two FPUs, each group requires two cycles to dispatch. A machine with three FPUs will dispatch the first three floating-point instructions within a group in one cycle, and the remaining instruction in the next. Source: [MARK00] 21.7 p1 comparison p2 p3 not present 0 0 1 not present 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 1 1 1 0 21.8 a. (3) and (4); (5) and (6) b. The IA-64 template field gives a great deal of flexibility, so that many combinations are possible. One obvious combination would be (1), (2), and (3) in the first instruction; (4), (5), and (6) in the second instruction; and (7) in the third instruction. 21.9 Branching to label error should occur if and only if at least one of the 8 bytes in register r16 contains a non-digit ASCII code. So the comments are not inaccurate but are not as helpful as they could be. Source: [EVAN03] -128- 21.10 a. mov r1, 0 mov r2, 0 ld r3, addr(A) L1: ld r4, mem(r3+r2) bge r4, 50, L2 add r5, r5, 1 jump L3 L2: add r6, r6, 1 L3: add r1, r1, 1 add r2, r2, 4 blt r1, 100, L1 b. mov r1, 0 mov r2, 0 ld r3, addr(A) L1: ld r4, mem(r3+r2) cmp.ge p1, p2 = r4. 50 (p2) add r5 = 1, r5 (p1) add r6 = 1, r6 add r1 = 1, r1 add r2 = 4, r2 blt r1, 100, L1 21.11 a. fmpy t = p, q //floating-point multiply ldf.a c = [rj];; //advanced floating point load //load value stored in location specified by address //in register rj; place value in floating-point register c //assume rj points to a[j] stf [ri] = t;; //store value in floating-point register t in location //specified by address in register ri //assume ri points to a[i] ldf.c c = [rj];; //cutes only if ri = rj If the advanced load succeeded, the ldf.c will complete in one cycle, and c can be used in the following instruction. The effective latency of the ldf.a instruction has been reduced by the latency of the floating-point multiplication. The stf and ldf.c cannot be in the same instruction group, because there may be a read-after -write dependency. b. fmpy t = p, q cmp.ne p8, p9 = ri, rj;; (p8) ldf c = [rj];; //p8 ⇒ no conflict stf [ri] = t;; //if ri = rj, then c = t (p9) mov c = t;; c. In the predicated version, the load begins one cycle later than with the advanced load. Also, two predicated registers are required. Source: [MARK00] -129- 21.12 a. The number of output registers is SOO = SOF – SOL = 48 – 16 = 32 b. Because the stacked register group starts at r32, the local register and output register groups consist of: Local register group: r32 through r47 Output register group: r48 through r63 Source: [TRIE01] -130- APPENDIX B ASSEMBLY LANGUAGE AND RELATED TOPICS A A NSWERS TO NSWERS TO QQUESTIONSUESTIONS B.1 1. It clarifies the cution of instructions. 2. It shows how data is represented in memory. 3. It shows how a program interacts with the operating system, processor, and the I/O system. 4. It clarifies how a program accesses external devices. 5. Understanding assembly language programmers makes students better high- level language (HLL) programmers, by giving them a better idea of the target language that the HLL must be translated into. B.2 Assembly language is a programming language that is one step away from machine language. Assembly language includes symbolic names for locations. It also includes directives and macros. B.3 1. Development time. Writing code in assembly language takes much longer time than in a high level language. 2. Reliability and security. It is easy to make errors in assembly code. The assembler is not checking if the calling conventions and register save conventions are obeyed. Nobody is checking for you if the number of PUSH and POP instructions is the same in all possible branches and paths. There are so many possibilities for hidden errors in assembly code that it affects the reliability and security of the project unless you have a very systematic approach to testing and verifying. 3. Debugging and verifying. Assembly code is more difficult to debug and verify because there are more possibilities for errors than in high-level code. 4. Maintainability. Assembly code is more difficult to modify and maintain because the language allows unstructured spaghetti code and all kinds of dirty tricks that are difficult for others to understand. Thorough documentation and a consistent programming style are needed. 5. Portability. Assembly code is very plat-specific. Porting to a different plat is difficult. 6. System code can use intrinsic functions instead of assembly. The best modern C++ compilers have intrinsic functions for accessing system control registers and other system instructions. Assembly code is no longer needed for device drivers and other system code when intrinsic functions are available. 7. Application code can use intrinsic functions or vector classes instead of assembly. The best modern C++ compilers have intrinsic functions for vector -131- operations and other special instructions that previously required assembly programming. 8. Compilers have been improved a lot in recent years. The best compilers are now quite good. It takes a lot of expertise and experience to optimize better than the best C++ compiler. B.4 1. Debugging and verifying. Looking at compiler-generated assembly code or the disassembly window in a debugger is useful for finding errors and for checking how well a compiler optimizes a particular piece of code. 2. Making compilers. Understanding assembly coding techniques is necessary for making compilers, debuggers and other development tools. 3. Embedded systems. Small embedded systems have fewer resources than PCs and mainframes. Assembly programming can be necessary for optimizing code for speed or size in small embedded systems. 4. Hardware drivers and system code. Accessing hardware, system control registers etc. may sometimes be difficult or impossible with high level code. 5. Accessing instructions that are not accessible from high level language. Certain assembly instructions have no high-level language equivalent. 6. Self-modifying code. Self-modifying code is generally not profitable because it interferes with efficient code caching. It may, however, be advantageous for example to include a small compiler in math programs where a user-defined function has to be calculated many times. 7. Optimizing code for size. Storage space and memory is so cheap nowadays that it is not worth the effort to use assembly language for reducing code size. However, cache size is still such a critical resource that it may be useful in some cases to optimize a critical piece of code for size in order to make it fit into the code cache. 8. Optimizing code for speed. Modern C++ compilers generally optimize code quite well in most cases. But there are still many cases where compilers per poorly and where dramatic increases in speed can be achieved by careful assembly programming. 9. Function libraries. The total benefit of optimizing code is higher in function libraries that are used by many programmers. 10. Making function libraries compatible with multiple compilers and operating systems. It is possible to make library functions with multiple entries that are compatible with different compilers and different operating systems. This requires assembly programming. B.5 label, mnemonic, operand, and comment B.6 Instructions: symbolic representations of machine language instructions Directives: instruction to the assembler to per specified actions doing the assembly process Macro definitions: A macro definition is a section of code that the programmer writes once, and then can use many times. When the assembler encounters a macro call, it replaces the macro call with the macro itself. Comment: A statement consisting entirely of a comment. B.7 A two-pass assembler takes a first pass through the assembly program to construct a symbol table that contains a list of all labels and their associated location counter values. It then takes a second pass to translate the assembly program into object -132- code. A one-pass assembler combines both operations in a single pass, and resolves forward references on the fly. A A NSWERS TO NSWERS TO P P ROBLEMSROBLEMS B.1 a. When it cutes, this instruction copies itself to the next location and the program counter is incremented, thus pointing to the instruction just copied. Thus, Imp marches through the entire memory, placing a copy of itself in each location, and wiping out any rival program. b. Dwarf “bombs“ the core at regularly spaced locations with DATAs, while making sure it won t hit itself. The ADD instruction adds the immediate value 4 to the contents of the location 3 locations down, which is the DATA location. So the DATA location now has the value 4. Next, the COPY instruction copies the location 2 locations down, which is the DATA location, to the address contained in that location, which is a 4, so the COPY goes to the relative location 4 words down from the DATA location. Then we jump back to the ADD instruction, which adds 4 to the DATA location, bringing the value to 8. This process continues, so that data is written out in every fourth location. When memory wraps around, the data writes will miss the first three lines of Dwarf, so that Dwarf can continue indefinitely. We assume that the memory size is divisible by 4. c. Loop ADD #4, MemoryPtr COPY 2, @MemoryPtr JUMP Loop MemoryPtr DATA 0 B.2 The barrage of data laid down by Dwarf moves through the memory array faster than Imp moves, but it does not necessarily follow that Dwarf has the advantage. The question is: Will Dwarf hit Imp even if the barrage does catch up? If Imp reaches Dwarf first, Imp will in all probability plow right through Dwarf s code. When Dwarf s JUMP –2 instruction transfers cution back two steps, the instruction found there will be Imp s COPY 0, 1. As a result Dwarf will be subverted and become a second Imp endlessly chasing the first one around the array. Under the rules of Core War the battle is a draw. (Note that this is the outcome to be expected “in all probability.“ Students are invited to analyze other possibilities and perhaps discover the bizarre result of one of them.) B.3 Loop COPY #0, @MemoryPtr ADD #1, MemoryPtr JUMP Loop MemoryPtr DATA 0 B.4 This program (call it P) is intended to thwart Imp, by overwriting location Loop – 1, thus terminating the march of Imp from lower memory levels. However, timing is critical. Suppose Imp is currently located at Loop – 2 and P has just cuted the JUMP instruction. If it is now P s turn to cute, we have the following sequence: 1. P cutes the COPY instruction, placing a 0 in Loop – 1. 2. Imp copies itself to location Loop – 1. 3. P cutes the JUMP instruction, set its local program counter to Loop. 4. Imp copies itself to location Loop. -133- 5. P cutes the Imp instruction at Loop. The P program has been wiped out. On the other hand, suppose that Imp is currently located at Loop – 2; P has just cuted the JUMP instruction; and it is now Imp s turn to cute. We have the following sequence: 1. Imp copies itself to location Loop – 1. 2. P cutes the COPY instruction, placing a 0 in Loop – 1. 3. Imp attempts to cute at location Loop – 1, but there is only a null instruction there. Imp has been wiped out. B.5 a. CF = 0 b. CF = 1 B.6 If there is no overflow, then the difference will have the correct value and must be non-negative. Thus, SF = OF = 0. However, if there is an overflow, the difference will not have the correct value (and in fact will be negative). Thus, SF = OF = 1. B.7 jmp next B.8 avg: resd 1 ; integer average i1: dd 20 ; first number in the average i2: dd 13 ; second number in the average i3: dd 82 ; first number in the average main: mov avg, i1 add avg, i2 add avg, i3 idiv avg, 3 ; get integer average B.9 cmp eax, 0 ; sets ZF if eax = 0 je thenblock ; If ZF set, branch to thenblock mov ebx, 2 ; ELSE part of IF statement jmp next ; jump over THEN part of IF thenblock: mov ebx, 1 ; THEN part of IF next: B.10 msglen is assigned the constant 12 B.11 V1: resw 1 ; values must be assigned V2: resw 1 ; before program starts V3: resw 1 main: mov ax, V1 ; load V1 for testing cmp ax, V2 ; if ax <= V2 then jbe L1 ; jump to L1 mov ax, V2 ; else move V1 to ax L1: cmp ax, V3 ; if ax <= V2 then jbe L2 ; jump to L2 mov ax, V3 ; else move V1 to ax L2: B.12 The compare instruction subtracts the second argument from the first argument, but does not store the result; it only sets the status flags. The effect of this -134- instruction is to copy the zero flag to the carry flag. That is, the value of CF after the cmp instruction is equal to the value of ZF just before the instruction. B.13 a. push ax push bx pop ax pop bx b. xor ax,bx xor bx,ax xor ax,bx B.14 IF X=A AND Y=B THEN { do something } ELSE { do something else } END IF B.15 a. The algorithm makes repeated use of the equation gcd (a, b) = gcd (b, a mod b) and begins by assuming a ≥ b. By definition, if both a and b are 0, then the gcd is 1. Also by definition b = 0, then gcd = a. The remainder of the C program implements the repeated application of the mod operator. -135- b. gcd: mov ebx,eax mov eax,edx test ebx,ebx ; bitwise AND to set CC bits jne L1 ; jump if ebx not equal to 0 test edx,edx jne L1 mov eax,1 ret ; return value in eax L1: test eax,eax jne L2 mov eax,ebx ret L2: test ebx,ebx je L5 ; jump if ebx equal to 0 L3; cmp ebx,eax je L5 ; jump if ebx = eax jae L4 ; jump if ebx above/equal eax sub eax,ebx jmp L3 L4: sub ebx,eax jmp L3 L5: ret b. gcd: neg eax ; take twos complement of eax je L3 ; jump if eax equal to 0 L1: neg eax xchg eax,edx ; exchange contents of eax and edx L2: sub eax,edx jg L2 ; jump if eax greater than edx jne L1 ; jump if eax not equal to edx L3: add eax,edx jne L4 inc eax L4: ret -136- B.16 a. The reason is that instructions are assembled in pass 2, where all the symbols are already in the symbol table; certain directives, however, are cuted in pass 1, where future symbols have not been found yet. Thus pass 1 directives cannot use future symbols. b. The simplest way is to add another pass. The directive ‘A EQU B+1’ can be handled in three passes. In the first pass, label A cannot be defined, since label B is not yet in the symbol table. However, later in the same pass, B is found and is stored in the symbol table. In the second pass label A can be defined and, in the third pass, the program can be assembled. This, of course, is not a general solution, since it is possible to nest future symbols very deep. Imagine something like: A EQU B - B EQU C - C EQU D - - D - Such a program requires four passes just to collect all the symbol definitions, followed by another pass to assemble instructions. Generally one could design a percolative assembler that would per as many passes as necessary, until no more future symbols remain. This may be a nice theoretical concept but its practical value is nil. Cases such as ‘A EQU B’, where B is a future symbol, are not important and can be considered invalid. B.17 It is cuted in pass 1 since it affects the symbol table. It is cuted by uating and comparing the expressions in the operand field.

计算机组成与设计project1,计算机组成与体系结构(性能设计)答案完整版-第八版...相关推荐

计算机操作系统英文版课后答案,计算机操作系统(第3版)课后习题答案(完整版)...
内容简介: 计算机操作系统(第3版)课后习题答案(完整版) 第一章 1．设计现代OS的主要目标是什么? 答:(1)有效性 (2)方便性 (3)可扩充性 (4)开放性 2．OS的作用可表现在哪几个方面? ...
计算机二级ps教程百度云,全国计算机等级考试一级Photoshop模拟题及解析第六套(完整版).pdf...
全国计算机等级考试一级 Photoshop 模拟题及解析第六套(完整版) 一.单选题: 1.在设定层效果(图层样式)时 A. 光线照射的角度时固定的 B. 光线照射的角度可以任意设定 C. 光线照射 ...
网络统考计算机实机操作,2020年国家开放大学电大考试《计算机应用基础》网络核心课形考网考作业试题及答案(完整版)（42页）-原创力文档...
(2019秋更新版)国家开放大学电大专科<计算机应用基础>网络核心课形考作业答案(完整版) 100%通过考试说明:2019年春期电大把该课程纳入到"国开平台"进行考核 ...
18大学计算机基础,最新大学计算机基础试题及答案完整版（18页）-原创力文档...
大学计算机基础试题及答案完整版 1 大学计算机基础试题及答案完整版 1 2 2 一 .单选题 3 一 .单选题 3 1.完整的计算机系统由 (C )组成. 4 1.完整的计算机系统由 (C )组成. ...
公务员考试中公共基础知识计算机,2012山东省公务员考试公共基础知识最新考试试题库(完整版)...
2012山东省公务员考试公共基础知识最新考试试题库(完整版) 件将会____. A.永远不再发送 B.需要对方再次发次 C.保存在服务商的主机上 D.退回发信人 18.下列部件中,不属于计算机主机内的 ...
口是心非用计算机弹,抖音口是心非原唱是谁抖音口是心非完整版歌词
抖音<口是心非>原唱是谁.<口是心非>这首歌最近在抖音上火到不行,很多抖音用户都可以哼唱上几句,大家应该也是不会陌生的.那么大家知道这首<口是心非>的原唱是谁吗?下 ...
用计算机弹奏胧月初音未来,胧月钢琴谱-初音未来-完整版
胧月是初音未来的翻唱歌曲,此曲最初的版本是由巡音ルカ(巡音流歌)演唱.qinyipu.com歌曲带点淡淡的忧伤,衬出月的寒美,再画出一幅唯美的画卷,加以初音的精神美,音色朦胧,惟妙惟肖地把这首歌的精髓 ...
微信游戏脑力大乱斗92一个计算机,微信脑力大乱斗x游戏_脑力大乱斗x游戏答案完整版预约_第一手游网...
微信脑力大乱斗x这款火爆微信小程序的趣味答题类闯关游戏,游戏以各种脑经急转弯方式的题目让很多玩家沉迷其中,各种搞笑有趣的题目你可以玩到笑喷,不知道微信脑力大乱斗x题目答案的小伙伴们可以在下方了解一下哦 ...
计算机音乐谱大全桥边姑娘,桥边姑娘-总谱完整版
Introduction 照旧,为了方便大家按照自己的想法修改想要的效果,所以我在制作此MIDI的时候用了最简单的GM2鼓和原生钢琴.吉他做的安静版本,但也应该算是目前最完整的一个版本了.也请下载的朋 ...

计算机组成与设计project1,计算机组成与体系结构(性能设计)答案完整版-第八版...

计算机组成与设计project1,计算机组成与体系结构(性能设计)答案完整版-第八版...相关推荐

最新文章

热门文章