Solution to HW2 of CMSC611




1.               Assume that the number of instructions is 100 .

                   DATA REFERENCE : percentage = 35%, total = 35 instructions
                    BRANCH                  : percentage = 19%, total = 19 instructions
                    ALU operations         :  percentage = 46%, total = 46 instructions

          a. Allowed offset lenghts are 0-, 8, and 16-bits including sign bit.
                for data reference ,
                                                17% * 35  =   5.95    -----------> offset of 0
                                                 40% * 35 = 14.00     -----------> offset of 8
                                                 43% * 35 =  15.05    -----------> offset of 16
                 for branch instructions,
                                                  no instructions have an offset of 0.
                                                   93% * 19 = 17.67   -----------> offset of 8
                                                    7% * 19  =  1.33    -----------> offset of 16
                for ALU instructions,
                                                    all the 46 instructions have an offset of 0. they are all 16 bit instructions.

               # bits = (46+ 5.95) *16 + (14 + 17.67) * 24 + (15.05 + 1.33) * 32 = 2112.8
                #bytes = 2112.8 / 8 = 264.1

                #bits / instruction = 2112.8 / 100 = 21.13

       b.  since all the instructions have a fixed length of 24 bits, the instructions with 0- and 8-bit offset will not have any change, but
           the 32 bit instructions  ( having an offset of 16 bits) will take two 24-bit instructions.

                # bits = (46+ 5.95) *24 + (14 + 17.67) * 24 + (15.05 + 1.33) * 2* 24 = 2790
                # bytes = 2790 / 8 = 348.75
                a comparison of (a) with (b) is   ----------   348.75 / 264.1  = 1.32

        c.   the instructions are assumed to have a fixed offset of 16 bits. thus all the instructions which have an offset of 8-bits will now
              use 16-bits for the same.

                #bits = (46+ 5.95) *16 + (14 + 17.67) * 32 + (15.05 + 1.33) * 32 = 2368.8
                #bytes = 2368.8 / 8 = 296.1

                a comparison of (c) with (b) gives --------- (c) / (b) = 296.1 / 348.75  = 0.849
 
 

2. let the total instruction counts are IC
    CPUtime= IC * CPI * Clock cycle

    Instruction counts for ALU are:
    ALU = 23.8% + 8.3% +1.3% + 7.0% + 9.4% + 4.8% + 2.0% = 56.6%

    Instruction counts for Jumps are:
    Jumps = 0.5% + 0.4% + 0.5%  = 1.4 %

    Conditional branches (taken) counts = 15% * 60% = 9%;
    conditional branches (not taken) counts = 15% * 40% = 6%;

    so, CPUtime(old) = IC* (56.6% * 0.5 + 1.4% * 1.6 + 20.9% *1.7 + 5.1%*1.0
                   + 9% *2.2 + 6%*0.6 )* clock cycle
                 =1.1155IC/600

    Assume x load & store instruction must use the updating addressing mode for the CPU speed
    to remain the same, becuase each updating addressing mode will decrease add or subtract by
    one, so, if we run the same amount of instruction after updating, the two machines should
    consume the same amount of time. we have known the load & instruction take 26% total counts,
    so,

    CPIorig/600 =(CPIorig - 0.8* 26%*x)/550
    1.1155/600 = (1.1155 -0.8 * 26%*x)/550
    x= 44.69%
 

3. A=X[B]+Y[B]
    E=Y[B]+Z[B]

    Implementation for acumulator:
                            --comments---
    load B
    Mul d
    store tmp ;----- tmp = [B]
    add X
    store oper1 ------oper1=X[B]
    load Y
    add tmp
    store oper2  -------- oper2=Y[B]
    load (oper1)
    add (oper2)
    store A  ----------A=X[B]+Y[B]

    load Z
    add tmp
    store oper3
    load (oper2)  --------oper2=Y[B]
    add (oper3) -----oper3=Z[B]
    store E ------E=Y[B]+Z[B]

    Instructions bytes = 16*3+5=53
    Memory data bytes=16*4=64
    -----------
    Implementation of stack

    push B
    push d
    mul
    pop tmp
    push tmp
    push x
    add
    pop addrx
    push tmp
    push Y
    add
    pop addry
    push (addry)
    push (addrx)
    add
    pop A

    push tmp
    push Z
    add
    pop addrZ
    push (addrZ)
    push (addrY)
    add
    pop E

    Instruction bytes fetched =3*17+6+5=62 bytes
    Memory data bytes = 4*17=68 bytes
    ----------------------
    Implementataion of load-store

    load R1,B
    load R2,d
    mul R3,R1,R2 ;R3=dis
    load R4,x
    add R5,R3,R4 ;R5=addrX
    load R6, Y
    add R7,R3,R6 ;R7=AddrY
    load R8,(R5)
    load R9,(R7)
    add R10,R8,R9 ; =A
    load R11,Z
    add R12,R11,R3
    load R13,(R12)
    add R14,R13,R9
    store A,R10
    store E,R14

    Instruction bytes fetched = 3*2+6*4+1*6+5*3=51 bytes
    Memory data bytes = 6*4+1*0+3*4+5*0=36 bytes
    -----------------------
    Implementation of memory to memory

    mul dis,B,d
    Add addrx, x, dis
    add addry,y,dis
    add A,(addrx),(addry)
    add addrz,Z,dis
    add E,(addry),(addrz)

    Instruction bytes = 5*7+9=44 bytes
    Memory data bytes transfered = 17*4=68 bytes
    Since mem-meo architecture has the smallest number on instruction bytes fetched,
    this is most efficient measured by code size.  The load and store has the
    smallest of the combined number. so this is the most efficient architecture in
    total memory bandwidth.
 

4.  a. Effective CPI for compress:
            ALU instruction:  and, sub, compare, load imm, shift, and, or, orther
                     14.4% + 1.8% + 15.4% + 8.1% + 6.5% + 2.1% + 6.0% + 1.0% = 55.3%
            CPI = 0.8*55.3% + 1.7*19.8% + 1.0*5.6% + 2.2*60%*17.4% + 0.6*40%*17.4%
                       + 0.6*40%*17.4% + 1.6*(1.5%+0.1+0.1)
                    = 1.134

        b. Effective CPI for hydro2d:
            ALU instruction:
                       10.9% + 0.2% + 1.2% + 0.2% + 2.4% + 0.1% + 5.2% = 20.2%
            CPI= 0.8*20.2% + 1.7*(0.1%+24.1%) + 2.2*60%*11.7% + 0.6*40%*11.7%
                      + 2.1*(3.6%+7.9%+10.4%) + 2.5*9.4% + 4.8*1.6% + 5.5*0.2% + 1.0*9.9%
                   = 1.637
        c. When FP operations are replaced by integer instructions, for hydro2d, we can see
            from Figure that CPI of integer(0.8) ALU is much less that that of FP(2.1,2.5,4.8).
            Then we can expect that average CPI will be bigger after replacement.
 

5.         From the assembly code, we get distribution of instructions as
     follows, where N is number of times that Dhrystone benchmark runs.

        Load            84*N + 174
        Store           67*N + 36
        Add             70*N + 8
        Sub             14*N + 2
        Mul             1*N
        Div             1*N
        Compare         11*N
        Load imm        27*N + 20
        Cond branch     21*N + 3
        Jump            27*N + 69
        Call
        Return, jmp ind
        Shift           15*N + 1
        And             6*N
        Or              1*N
        Others          24*N + 45

     When N = 1, we get
        Load            84*1 + 174      35.49%
        Store           67*1 + 36       14.17%
        Add             70*1 + 8        10.73%
        Sub             14*1 + 2        2.2%
        Mul             1*1             0.138%
        Div             1*1             0.138%
        Compare         11*1            1.51%
        Load imm        27*1 + 20       6.46%
        Cond branch     21*1 + 3        3.3%
        Jump            27*1 + 69       13.20%
        Call
        Return, jmp ind
        Shift           15*1 + 1        2.2%
        And             6*1             0.825%
        Or              1*1             0.138%
        Others          24*1 + 45       9.49%

     When N = 100, we get
        Load            84*100 + 174    23.01%
        Store           67*100 + 36     18.08%
        Add             70*100 + 8      18.81%
        Sub             14*100 + 2      3.76%
        Mul             1*100           0.268%
        Div             1*100           0.268%
        Compare         11*100          2.95%
        Load imm        27*100 + 20     7.3%
        Cond branch     21*100 + 3      5.64%
        Jump            27*100 + 69     7.43%
        Call
        Return, jmp ind
        Shift           15*100 + 1      4.03%
        And             6*100           1.61%
        Or              1*100           0.268%
        Others          24*100 + 45     6.56%

     When N = 1000, we get
        Load            84*1000 + 174    22.79%
        Store           67*1000 + 36     18.15%
        Add             70*1000 + 8      18.95%
        Sub             14*1000 + 2      3.79%
        Mul             1*1000           0.271%
        Div             1*1000           0.271%
        Compare         11*1000          2.98%
        Load imm        27*1000 + 20     7.32%
        Cond branch     21*1000 + 3      5.69%
        Jump            27*1000 + 69     7.33%
        Call
        Return, jmp ind
        Shift           15*1000 + 1      4.06%
        And             6*1000           1.62%
        Or              1*1000           0.271%
        Others          24*1000 + 45     6.51%

     When N = 10000, we get
        Load            84*10000 + 174    22.77%
        Store           67*10000 + 36     18.16%
        Add             70*10000 + 8      18.97%
        Sub             14*10000 + 2      3.8%
        Mul             1*10000           0.271%
        Div             1*10000           0.271%
        Compare         11*10000          2.98%
        Load imm        27*10000 + 20     7.32%
        Cond branch     21*10000 + 3      5.69%
        Jump            27*10000 + 69     7.32%
        Call
        Return, jmp ind
        Shift           15*10000 + 1      4.06%
        And             6*10000           1.63%
        Or              1*10000           0.271%
        Others          24*10000 + 45     6.50%

        From above, we know when N is big enough, the percentage of each
     instruction will almost be constant. When we compare this to the
     instruction frequencies in Figure 2.26, we find that there is not a
     particular program matching Dhrystone benchmark very well, though in
     some sense gcc matches this benchmark. The reason is that the most
     significant parts in Dhrystone benchmark (as well as in gcc) are
     Load, Add and Store instructions. Probably Dhrystone benchmark is an
     integer computation program.