Printed Pages: 4 TCS-802 (Following Paper ID and Roll No. to be filled in your Answer Book) PAPER ID: 0148 Roll No. B. Tech. ## (SEM. VIII) EXAMINATION, 2007-08 ADVANCE COMPUTER ARCHITECTURE Time: 3 Hours] [Total Marks: 100 - **Notes**: (1) Attempt all questions. - All question carry equal marks. (2) - Be precise in your answer. (3) - No second Answer book will be provided. - Attempt any two parts of the following: - Explain how instruction set, compiler technology, CPU (a) implementation and control and memory hierarchy affect the CPU performance and justify the effects in term of program length, clock rate and effective CPI. - Distinguish between the following: (b) - Medium-grain and fine-grain multicomputers in (i) their architectures and programming requirements. - Single threaded and multi-threaded processor (ii) architectures. - (c) Consider a shared bus parallel computer built using 32 bit RISC processors running at 150 MHz with CPI-1. Assume that 15% of the instructions are loads and 10% are stores. Assume 0.95 hit ratio to cache for read and write through caches. The bandwidth of the bus is 2 GB / sec. - How many processors can the bus support without getting saturated? U-0148] [Contd... - (ii) If caches are not there, how many processors can the bus support assuming the main memory is as fas as the cache? - 2 Attempt any two parts of the following: - (a) Discuss data and resource dependences to exploit implicit parallelism in the program. Analyze the following instructions sequence: S1 : Load $R_1$ , 1024 / $R_1 \leftarrow 10241$ / S2 : Load $R_2$ , M(10) / $R_2 \leftarrow$ Mem (10) / S3 : add $R_1$ , $R_2 / R_1 \leftarrow R_1 + R_2$ S4: store M (1024), $R_1/Mem$ ((1024)) $\leftarrow$ ( $R_1$ ) S5 : store M ((R<sub>2</sub>)), 1024 / Memo (R<sub>2</sub>) $\leftarrow$ 1024)/ Memory (10) constains 64 initially. - (i) Draw a dependence graph to show all the dependences. - (ii) Are there any resource dependences? If only one copy of functional unit is available in the CPU. - (b) Explain the cache address mapping techniques. Consider a cache memory (M<sub>1</sub>) and memory (M<sub>2</sub>) hierarchy with, the following characteristics: M<sub>1</sub>: 16 k words, 50 ns access time M<sub>2</sub>: 1 m words, 400 m access time. Assum A words cache blocks and a set size of 256 words with set anociative mapping. - (i) Show the mapping between $M_2$ and $M_1$ . - (ii) Calculat the effective memory access time with a cache hit ratio G = 0.95. - (c) Give the block diagram for a pipelined floating point adder. Assume that exponent matching takes 0.1 mec., mantissa alignment 0.2 n sec., adding mantissas 1.0 n sec and normalizing result 0.2 n sec. What will be the highest clock speed which can be used to drive the adder'. If two vectors of 100 components are to be added using this adder what will be the time of addition. Attempt any two parts of the following: (a) Consider the five-stage pipelined processor specified by the following reservation table: | | 1 | 2 | 3 | 4 | 5 | 6 | |-------|------|------|---------|---|------|------------------| | $S_1$ | X | in i | , hj | | | $\boldsymbol{X}$ | | $S_2$ | | X | 14-1 | | X | - | | $S_3$ | gon; | | X | | 33.5 | | | $S_4$ | | | neto L. | X | - | | | $S_5$ | | X | | | | X | - (i) List the set of forbidden patencies and the collision vector. - (ii) Draw the transition diagram showing all possible initial sequences without causing collision in the pipeline. - (iii) Identyfy simple cycles, greedy cycles and MAL (minimum average latency). - (iv) What will be the maximum through put of this pipeline. - (b) Discuss the superscalar and superpipelined processing. Also estimate the performance of superpiplined superscalar processor of degree (m, n). - (c) Give the difference between a thread, a trace and a process. Also explain how simultaneous multithreading is superior to multithreading (blocked and interleaved). What extra processor resources are required to support simultaneous multihreading? Attempt any two parts of the following: (a) Discuss the matrix multiplecation on a Mesh. Give the algorithem that uses n × n processors arranged in a mesh contiguration. Also find the time complexity of the algerithm. - (b) Give the PRAM algorithm for solving a first order linear recurrence - $x_i = a_i \ x_{i-1} + d_i$ for i = 1, 2...n, where the value of $x_0$ , $a_1$ , $a_2$ ... $a_n$ and $d_1$ , $d_2$ ... $d_n$ are given. Assume $x_0 = 0$ and $a_1 = 0$ . - (c) Explain the Bidirectional Gaussian elemination for solving a set of linear algebraic equation. - 5 Attempt any two parts of the following: - (a) Consider the following double toop ( $L_1 \& L_2$ ) $L_1:$ for (2 = 0; i<4, 1++) { $L_2:$ for (J = 0; J<4; J++) S = A [L+1] [J] = B[i] [J] + C [i] [J], T: B [i] [J+1] = A [2] [J+1] + 1 U: D (i, j) = B [i] [J-2] 2} - (i) Find the dependence caused by all possible pairs of variables. Identify the type of each dependences and find the its distance and direction vector. - (ii) Draw the statement and iteration dependence graph for the loop. - (b) Explain the following terms associated with fast and efficient synchronization schemes on a shared-memory multiprocessor: - (i) Bersy-wait verses sleep-wait protocols for sole access of a critical section. - (ii) Lock mechanisms for pre-synchronization to achieve sole access to a critical section. - (iii) Post synchronization method. - (c) Write short notes on any two of the following: - (i) Combined parallel work-sharing construct - (ii) Run-time library routines - (iii) Parallel execution environment routines.