The prototype is optimized to do multiway branching every cycle,
without branch stalls. Here is an example: a C code fragment for
finding the maximum of an array:
for(i=0; i < N; i++)
if (a[i] > max) max=a[i];
At cycle n, the relevant register contents are (assuming
i=0; N>0;):
mar0, r4=&a[i]; r8=&a[N-1]; r2=max=-INF;
The VLIW code for the loop starts as follows
(LOOP ((LOAD mar$0 (r$1)) ;r1=load a[i] (*r4)
(ADD r$4 4 (r$4 mar$0)) ;increment array pointer r4 (=&a[i+1])
;also copy it to mar0
(GOTO L1)))
Compare operations set the condition codes cc0, cc1 in
cycle n+1 (notice that the second iteration is started before the
first one is finished; this is called a software
pipelining technique):
(L1 ((GT r$1 r$2 (cc$0)) ;cc0= (a[i] > max)
(GT r$4 r$8 (cc$1)) ;cc1= (&a[i+1] > &a[N-1])
(LOAD? mar$0 (r$3)) ;r3=load a[i+1] (*r4) (2nd iteration)
;cannot overwrite r1 (=a[i]) yet,
;still need to update max
;"?" means a speculative operation
(ADD? r$4 4 (r$4 mar$0)) ;r4,mar0= &a[i+2] (2nd iteration)
(GOTO L2)))
Then. multiway branching is done in cycle n+2 using these condition
codes, and the correct target instruction (
L2 or exit)
is fetched (further compares may set condition codes at the same time):
(L2 ((IF (NOT cc$0) ;if not(a[i]>max)
((COPY? r$3 (r$1)) ;can overwrite r1 with a[i+1] now
(IF (NOT cc$1) ;if not(&a[i+1]>&a[N-1])
((GT r$3 r$2 (cc$0)) ;cc0= (a[i+1] > max) (2nd it.)
(GT r$4 r$8 (cc$1)) ;cc1= (&a[i+2]>&a[N-1]) (2nd it.)
(LOAD? mar$0 (r$3)) ;r3=load a[i+2] (*r4) (3rd it.)
(ADD? r$4 4 (r$4 mar$0)) ;r4,mar0=&a[i+3] (3rd it.)
(GOTO L2))
ELSE ;(&a[i+1]>&a[N-1])
((GOTO exit)) )) ;r2= maximum of array at exit
ELSE ;if (a[i]>max)
((COPY r$1 (r$2)) ;update max= a[i]
(COPY? r$3 (r$1)) ;can overwrite r1 with a[i+1] now
(IF (NOT cc$1) ;if not(&a[i+1]>&a[N-1])
((GT r$3 r$1 (cc$0)) ;cc0= (a[i+1] > a[i]) (2nd it.)
;GT uses OLD r$1
(GT r$4 r$8 (cc$1)) ;cc1= (&a[i+2]>&a[N-1]) (2nd it.)
(LOAD? mar$0 (r$3)) ;r2=load a[i+2] (3rd it.)
(ADD? r$4 4 (r$4 mar$0)) ;r4,mar0=&a[i+3] (3rd it.)
(GOTO L2))
ELSE ;(&a[i+1]>&a[N-1])
((GOTO exit)) )) ))) ;r2= maximum of array at exit
Then, the branch target VLIW is executed in cycle n+3 (either
L2 or exit). Notice that the loop is executed at
a rate of one iteration per cycle, and 4 way branching is done
every cycle, without incurring any branch stalls.
Identical operations that occur in multiple places in the tree
(such as (COPY? r$3 (r$1))) require only one ALU. So the
VLIW L2 uses a total of 7 ALUs, since there
are only 7 distinct operations.