IBM
Skip to main content
 
Search IBM Research
     Home  |  Products & services  |  Support & downloads  |  My account
Select a country
IBM Home
IBM Research
VLIW Home
The VLIW project
Basic Principles
A VLIW based on
tree instructions
Processor Prototype
· VLIW Code
· Prototype implementation
· Semicustom chip
· Prototype images
VLIW Compiler
Simulation Environment
DAISY dynamic translation
More information
Talks and
Presentations
Publications
and Patents
Selected Abstracts
mikeg@watson.ibm.com


VLIW at IBM Research 
  VLIW prototype semicustom chip 

The VLIW prototype includes an IBM CMOS IIs standard cell chip designed at the gate level. Since the CMOS IIs technology was both area-limited and pin-limited for VLIW, extensive partitioning was required. The semicustom chip implements three different functions, depending on a "chip characteristic" input:

  1. a 24 port register file;
  2. a next address multiplexer for 8-way branching and conditional execution; and
  3. a crossbar switch for performing 4 memory accesses from 8 interleaved memory banks.

Register file

The register file (RF) consists of 24-port, 64*33-bit registers which allows 16 read and 8 write operations in each cycle for the 8 ALUs. The result of an operation performed in cycle n is written into the register file in cycle n+1 during the subsequent write-back pipeline stage. However, bypass paths between all ALUs are implemented, so that any ALU can use the result of any other ALU at the immediately following cycle.

Partitioning:
There are two copies of the register file, each allowing 8 read and 8 write operations; each chip implements a 5-bit slice of the 8R/8W port register file (14 RF chips total).

Critical delays:
At the beginning of an execution cycle, the ALU input is read from the register file or an immediate field, or data that has not yet been written to the register file is bypassed from an ALU write buffer (also inside the RF chip); this path contains three gate levels (plus output driver overhead).

Next-address multiplexer

A next address multiplexer (NAMUX) implements the 8-way branching in a tree-instruction, conditional execution, as well as the interrupt logic.

Partitioning:
There are four copies of the chip, with each one being responsible for driving the address for two target instructions of the current tree-instruction. The outputs from the four chips are ORed together externally, and sent to the instruction memory. Only one of the 8 possible next addresses will be driven to the instruction memory.

Critical delays:
Copies of condition codes are kept in each NAMUX chip. The next instruction address is computed from the condition codes within two AND-OR gate levels in each chip (plus output driver overhead and ORing).

Crossbar switch

A 4*8 crossbar switch (XBAR) allows four memory accesses to any of the eight interleaved banks of data memory (i.e. 4 memory ports are connected to 8 memory banks). The crossbar switch takes its addresses from special "MAR" (memory address) registers inside the crossbar, which must have been loaded in a previous cycle. All four accesses complete in one cycle when there are no bank conflicts. A stall occurs for the minimum number of cycles, when more than one port accesses the same bank.

Partitioning:
There are eight copies of the chip, each implementing a 4-bit slice of the 4+8 data/address paths. Each data path is subdivided into 4-bit slices, to facilitate the required byte rotates for byte and halfword loads, with the following bit configuration: (0,8,16,24), (1,9,17,25),...(7,15,23,31)

Critical delays:
A copy of the low order bits of all MAR registers (used for bank selection and byte/halfword selection within word) are kept in every XBAR chip. For a load, the 4-way port-to-bank address multiplexing is done in two AND-OR gate levels starting from the MAR registers (including one logic level hidden inside LSSD latches). The 8-way bank-to-port data multiplexing, store buffer bypassing, and rotate for byte/halfword loads take two AND-OR levels on the way back from the data memory (plus driver/receiver overhead).

Critical paths

We concentrated on reducing the following path delays:

  • RF access or bypass, ALU operation (RF clock to data output + ALU + wire delays + skew + setup of ``condition code'' inputs for NAMUX, before next clock);
  • data cache load (XBAR clock to bank address output + PAL for SRAM chip select + SRAM access time + XBAR bank-to-port mux delay + wire delays + skew + setup of ``write buffer'' inputs for RF, before next clock);
  • computation of next address from condition codes, and fetch VLIW (NAMUX clock to instruction address output + external logic for OR'ing addresses + external logic for muxing the address from NAMUX and the address from PS/2 + SRAM access time + wire delays+ skew + setup of ``source register number'' inputs for RF, before next clock).

These paths were all measured to be under 60ns. However, there was as least one unexpected path that was longer, discovered after the system was built, thus forcing the use of a longer cycle time (90 ns) than the original design. The long path occurs, for example, when two store operations cause a bank conflict by accessing the same memory bank:

  • from the condition codes, a NAMUX chip generates a signal that indicates a store operation is on the taken path of the tree;
  • the signals from the 4 NAMUX chips are ORed externally;
  • based on this signal, XBAR generates a signal that there is a bank conflict, i.e. more than one memory operation is accessing the same bank (this XBAR delay was longer than anticipated);
  • the bank-conflict signal is ORed on the board, in a PAL, with other stall signals, generating a clock gate signal;
  • the clock gate signal is driven to many places in the machine, and must be available sufficiently before the next clock period starts (less time is available than the full clock period, to generate the clock gate signal).

 
  VLIW Prototype Image 
[VLIW prototype image]
  Related Research 
arrow DAISY
arrow LaTTe: an open-source JIT compiler
  More Information
arrow Talks and Presentations
arrow Publications and Patents
arrow Selected Abstracts

 
  About IBM  | Privacy  | Legal  | Contact