Simulation and performance evaluation are critical during the early
stages of designing, refining and experimenting with new computer
architectures. Ideally, the tools used for these purposes should
be efficient, allow experimentation with realistic workloads, be
easily reconfigurable, and permit the evaluation of a variety of
system designs. Moreover, the tools should support the development
process, which requires fast simulation turn-around time, and
the collection of accurate performance estimates, which may be
more time consuming.
Our VLIW environment uses an integrated/modular approach to simulation
and performance measurement, oriented towards an early-stage evaluation
of our VLIW architecture. This environment, which achieves a high
degree of efficiency and versatility, consists of three major
components:
- The VLIW compiler, which generates
tree-instructions.
- A translator, which maps the VLIW
program (tree-instructions) into a simulation executable that is run
on an IBM RISC System/6000.
Thus, the simulation executable consists of RS/6000 code that directly
emulates the original native code of the VLIW architecture, as opposed
to an interpreter using the native code as input.
- A processor model and a memory model, which are invoked by the
simulation executable generated by the translator.
Performance measurement capabilities are integrated into the simulation
executable through instrumentation to collect statistics regarding the
emulated VLIW program. These mechanisms allow fast turn-around time for
experimenting with architectural features and compiler algorithms,
without yet introducing the detailed description of a processor and
memory implementation.
In addition, the simulation executable can include a decoded form of
the original VLIW code, and calls to a generic timer routine. When
such an augmented simulation executable is run, a timer is invoked
before each VLIW instruction is emulated, and passed the decoded
version of the VLIW as well as an image of the current machine state.
The machine state, maintained by the simulator, specifies information
such as the contents of the registers of the VLIW architecture.
In this way, there is a clear separation between simulation of the
instruction-set architecture (ISA) and simulation of a particular
implementation of the architecture, though both levels of
simulation are possible with the integrated environment.
The timer invoked by the simulation executable consists of two parts,
a processor model and a memory model. The processor model maintains
the cycle count and other performance statistics, dealing with items
such as register dependencies and operation latencies, for a given
processor implementation. For memory operations, the processor model
invokes the memory model, passing information such as the operation
type and effective address. The processor and memory models each
have a clearly defined interface, allowing a variety of models to be
used interchangeably, with the models differing in both the system
configuration they implement and in the degree of detail and
accuracy involved. This versatility is further enhanced by ensuring
that the interfaces provide all of the information required by the
most detailed or accurate model that may be needed, even if that
information is not necessary for earlier, simpler models.
The high efficiency of the timing environment is largely achieved
through the use of the pre-decoded descriptors for each VLIW.
A single read-only descriptor of a VLIW is repeatedly used during
the simulation, so the size of the descriptor is not an important
consideration. Thus, our descriptors are designed to minimize the
processing overhead of the timer. (In contrast, a conventional
trace-driven timing environment typically must strive to minimize
the size of the instruction and machine state information in the
trace, at the expense of decoding overhead in the timer.)
In practice, our timing environment has allowed us to dispense with the
generation of traces, and measure the performance of realistic workloads.
Programs such as the SPEC92 benchmark suite, the Linpack benchmark, and
the Livermore loops benchmark have been timed in their entirety.
Our simulation executables without timer calls typically run only about
15 times slower than the optimized native RS/6000 code for the same
program. Using a timer that models the VLIW processor at the functional
unit level and a memory hierarchy consisting of two levels of cache and
main memory, a full timing is slower than the simulation executable by
an additional factor of 75.
Our approach is particularly well suited to modeling a VLIW processor,
because fewer invocations of the timer minimize the procedure call
overhead, while the larger descriptors needed for VLIWs do not
affect the efficiency of the environment. However, the approach
can also be used for other architectures.
Since in some cases it is still desirable to generate traces, the
environment allows linking a trace dumper into the simulation
executable instead of a timer module. Using the same approach, we
have also envisioned the possibility of replacing the timer module
with some form of debugging environment.
Simulation/evaluation environment for a VLIW
processor architecture [Abstract],
published in the IBM Journal of Research and Development,
May 1997.
An integrated approach to architectural
simulation... (Foils, pdf, 54 KB)