IBM
Skip to main content
 
Search IBM Research
     Home  |  Products & services  |  Support & downloads  |  My account
Select a country
IBM Home
IBM Research
VLIW Home
The VLIW project
Basic Principles
A VLIW based on
tree instructions
Processor Prototype
VLIW Compiler
Simulation Environment
DAISY dynamic translation
More information
Talks and
Presentations
Publications
and Patents
Selected Abstracts
mikeg@watson.ibm.com


VLIW at IBM Research 
  Abstracts of Some Papers 

BOA: a second generation DAISY architecture

E.R. Altman, M. Gschwind
ISCA 2004 Tutorial
München, Germany, June 2004

Abstract

We describe the use of dynamic compilation technology in dynamically optimizing code to achieve peak performance. This is based on collecting and exploiting runtime system information to dynamically reoptimize code for specific workload bahvior. The dynamic runtime optimization system operates on a simple static architecture designed to achieve high ILP and great clock rates. We believe that this combination of raw execution speed and high-level code adaptation is an attractive way to build future architectures. While dynamic compilation and code optimization techniques described in this talk can be used to optimize code for native PowerPC execution, we believe that even greater benefits can be obtained by combining the runtime environment with a specially designed architecture, which provides additional capabilities such as additional rename registers and hardware support for exception recovery. In addition to these optimization advantages, the virtualization layer introduced by the dynamic compilation system offers the ability to customize the underlying execution engine and completely redefine the hardware interface, while maintaining binary compatibility at the software level (at either the program or operating system level, depending on the implementation choices made).

The DAISY system first introduced the techniques to dynamically optimize code in 1996, and since then, a number of dynamic compilation systems based on this technoology have expanded possible uses, such as the IBM BOA porject, the University of Wisconsin's Co-Designed Virtual Machines project, Transmeta's Crusoe processor technology, HP's Dynamo dynamic optimization system, and the Itanium-based Aries and IA32-EL execution layers.
[Presentation foils (pdf) ] [DAISY and BOA Bibliography (sorted by date)]

[ Top ]


Inherently Lower Complexity Architectures using Dynamic Optimization

M. Gschwind, E.R. Altman
Workshop on Complexity-Effective Design 2002
Anchorage, AK, May 2002.

Abstract

Based on the conviction that modern superscalar out-of-order designs squandered useful resources for little incremental gain, the BOA team embarked on a design effort to develop an architecture where computational elements dominated the design. At the same time, we wanted to preserve the ability to adapt to changing workload behavior dynamically, but without the overhead inherent in traditional out-of-order designs. We turned to maturing dynamic compilation technology to achieve dynamic adaptability, while keeping core complexity low.
[Presentation foils (pdf) ]

[ Top ]

Precise Exceptions in Dynamic Optimization

M. Gschwind, E.R. Altman
2002 Symposium on Compiler Construction,
Grenoble, France, April 2002
© Springer Verlag 2002

Abstract

In this work, we demonstrate how aggressive optimization can be used in conjunction with dynamic compilation without the need for specialized hardware. The approach is based on maintaining enough state to recompute the processor state when an unpredicted event such as a synchronous exception may make otherwise dead processor state visible. The transformations necessary to preserve precise exception capability can be performed in linear time.
[Presentation foils (pdf) ]

[ Top ]

Optimization and Precise Exceptions in Dynamic Compilation

M. Gschwind, E.R. Altman
Workshop on Binary Translation 2000,
Philadelphia, PA, October 2000.

Abstract

Maintaining precise exceptions is an important aspect of achieving full compatibility with a legacy architecture. While asynchronous exceptions can be deferred to an appropriate boundary in the code, synchronous exceptions must be taken when they occur. This introduces uncertainty into liveness analysis since processor state that is otherwise dead may be exposed when an exception handler is invoked. Previous systems either had to sacrifice full compatibility to achieve more freedom to perform optimization, use less aggressive optimization or rely on hardware support.
[Updated and extended foils presented at CC'02 (pdf) ]

[ Top ]

Binary Translation and Architecture Convergence Issues for IBM System/390

M. Gschwind, K. Ebcioglu, E.R. Altman, S. Sathaye
International Conference on Supercomputing 2000 (ICS '00)
Santa Fe, New Mexico, May 2000.

Abstract

We describe the design issues in an implementation of the ESA/390 architecture based on binary translation to a very long instruction word (VLIW) processor. During binary translation, complex ESA/390 instructions are decomposed into instruction primitives which are then scheduled onto a wide-issue machine. The aim is to achieve high instruction level parallelism due to the increased scheduling and optimization opportunities which can be exploited by binary translation software, combined with the efficiency of long instruction word architectures. A further aim is to study the feasibility of a common execution platform for different instruction set architectures, such as ESA/390, RS/6000, AS/400 and the Java Virtual Machine, so that multiple systems can be built around a common execution platform.
[ Foils (pdf) ]

[ Top ]

Optimizations and Oracle Parallelism with Dynamic Translation

K. Ebcioglu, E.R. Altman, S. Sathaye, and M. Gschwind
Proceedings of MICRO-32,
Haifa, Israel, November 1999.
© ACM 1999

Abstract

We describe several optimizations which can be employed in a it dynamic binary translation (DBT) system, where low compilation/translation overhead is essential. These optimizations achieve a high degree of ILP, sometimes even surpassing a static compiler employing more sophisticated, and more time-consuming algorithms. We present results in which we employ these optimizations in a dynamic binary translation system capable of computing oracle parallelism.
[Presentation foils (pdf, 588 KB) ]

[ Top ]

LaTTe: A Java VM Just-in-Time Compiler with Fast and Efficient Register Allocation

B.S. Yang, S.M. Moon, S. Park, J. Lee, S. Lee, J. Park, Y. C. Chung, S. Kim, K. Ebcioglu, E. Altman
Proc. PACT '99,
pp. 128-138, IEEE Computer Society Press, October 1999.
© IEEE 1999

Abstract

For network computing on desktop machines, fast execution of Java bytecode programs is essential because these machines are expected to run substantial application programs written in Java. Higher Java performance can be achieved by Just-in-Time (JIT) compilers which translate the stack-based bytecode into register-based machine code on demand. One crucial problem in Java JIT compilation is how to map and allocate stack entries and local variables into registers efficiently and quickly, so as to improve the Java performance.

This paper introduces LaTTe, a Java JIT compiler that performs fast and efficient register mapping and allocation for RISC machines. LaTTe first translates the bytecode into pseudo RISC code with symbolic registers, which is then register allocated while coalescing those copies corresponding to pushes and pops between local variables and the stack. The LaTTe JVM also includes an enhanced object model, a lightweight monitor, a fast mark-and-sweep garbage collector, and an on-demand exception handling mechanism, all of which are closely coordinated with LaTTe's JIT compilation.

Our experimental results on the SPARC platform with SPECJVM98 benchmarks and 15 non-trivial Java programs indicate that the current LaTTe JVMs achieve performance better than or comparable to the latest SUN JIT compilers (JDK 1.1.6 and HotSpot). It is also shown that LaTTe makes a reasonable trade-off between quality and speed of register allocation (i.e., the translation overhead consistently takes 1-2 seconds for SPECJVM98 which runs 40-80 seconds on LaTTe).

[ Top ]


On-Demand Translation of Java Exception Handlers in the LaTTe JVM Just-in-Time Compiler

S. I. Lee, B.S. Yang, S. Kim, S. Park, S.M. Moon, K. Ebcioglu, E. Altman
1999 Workshop on Binary Translation (Binary99),
Newport Beach, California, October 1999.

Abstract

Exceptions are provided by the Java programming language in order to handle errors more gracefully. However, exceptions are rare in most programs, so that most catch blocks are left unused. This paper describes an on-demand catch block translation scheme which translates catch blocks only when they are actually used. By ignoring exception handling while translating normal flow, we can reduce translation overhead and make use of more optimization opportunities. In addition, we directly connect normal flow and catch blocks after an exception is thrown, which results in faster exception handling. This is useful since many exceptions thrown at certain points are usually of the same type and handled by the same catch blocks.

From experimental results, we show that the existence of catch blocks indeed do not interfere with the translation of normal flow when on-demand catch block translation is done. Also, the results show that direct connection along with method inlining results in faster exception handling.

[ Top ]


BOA: Targeting Multi-Gigahertz with Binary Translation

S. Sathaye, P. Ledak, J. LeBlanc, S. Kosonocky, M. Gschwind, J. Fritts, Z. Filan, A. Bright, D. Appenzeller, E. Altman, and C. Agricola
1999 Workshop on Binary Translation (Binary99),
New Port Beach, California, October 1999.

Abstract

BOA is a processor designed to achieve high frequency by using software dynamic binary translation. Processors for software binary translation are very conducive to high frequency because they can assume a simple hardware design. Binary translation eliminates the binary compatibility problem faced by other processors, while dynamic recompilation enables re-optimization of critical program code sections and eliminates the need for dynamic scheduling hardware. In this work we examine the implications of binary translation on processor architecture and software translation and how we support a very high frequency PowerPC implementation via dynamic binary translation.
[Presentation foils (pdf) ]

[ Top ]

Execution-Based Scheduling for VLIW Architectures

K. Ebcioglu, E. Altman, S. Sathaye, M. Gschwind
Proc. Europar '99 (P. Amestoy, P. Berger, M. Dayde, I. Duff, V. Fraysse, L. Giraud, D. Ruiz, eds.),
pp. 1269-1280, Lecture Notes in Computer Science 1685, Springer-Verlag 1999.

Abstract

We describe a new dynamic software scheduling technique for VLIW architectures, which compiles into VLIW code the program paths that are actually executed. Unlike trace processors, or DIF, the technique executes operations speculatively on multiple paths through the code, is resilient to branch mispredictions, and can achieve very large dynamic window sizes necessary for high ILP. Aggressive optimizations are applied to frequently executed portions of the code. Encouraging performance results were obtained on SPECint95 and TPC-C. The technique can be used for binary translation for achieving architectural compatibility with an existing processor, or as a VLIW scheduling technique in its own right.
[Presentation foils (pdf) ]

[ Top ]

Lightweight Monitor in Java Virtual Machine

B.S. Yang, J. Lee, J. Park, S.M. Moon, and K. Ebcioglu.
ACM SIGARCH Computer Architecture News, March 1999.
Also in proc. of the Third Workshop on Interaction Between Compilers and Computer Architectures (INTERACT-3). In conjunction with ASPLOS-VIII, San Jose, CA, October 1998.

Abstract

This paper introduces the lightweight monitor in JAVA VM, that is fast on single-threaded programs as well as on multi-threaded programs with little lock contention. A 32 bit lock is embedded in each object for efficient access, while the lock queue and the wait set is managed through a hash table. The lock manipulation code is highly optimized and inlined by our JAVA VM JIT compiler called LaTTe, wherever the lock is accessed. In most cases, only 9 SPARC instructions are spent for lock acquisition, and 5 instructions for lock release. Our experimental results indicate that the lightweight monitor is faster than the monitor in the latest SUN (TM) JDK 1.2 release candidate 1 by up to 21 times in the absence of lock contention, and by up to 7 times in the presence of lock contention.

[ Top ]

An eight-issue tree-VLIW processor for dynamic binary translation

K. Ebcioglu, J. Fritts, S. Kosonocky, M. Gschwind, E. Altman, K. Kailas, T. Bright
Proc. International Conference on Computer Design: VLSI in Computers and Processors, 1998, pp. 488-495

Abstract

Presented is an 8-issue tree-VLIW processor designed for efficient support of dynamic binary translation. This processor confronts two primary problems faced by VLIW architectures: binary compatibility and branch performance. Binary compatibility with existing architectures is achieved through dynamic binary translation which translates and schedules PowerPC instructions to take advantage of the available instruction level parallelism. Efficient branch performance is achieved through tree instructions that support multi-way path and branch selection within a single VLIW instruction. The processor architecture is described, along with design details of the branch unit, pipeline, register file and memory hierarchy for a 0.25 micron standard-cell design. Performance simulations show that the simplicity of a VLIW architecture allows a wide-issue processor to operate at high frequencies.
[Presentation foils (pdf) ]

[ Top ]

Compilers for Instruction-level Parallelism

M. Schlansker (HP), T.M. Conte (North Carolina State U.), J. Dehnert (SGI), K. Ebcioglu (IBM), J.Z. Fang (Intel), C.L. Thompson (HP).
IEEE Computer Magazine 30(12), Dec. 1997, pp. 63-69
(This was a report from a cross-industry task force on ILP)

Abstract

Discovering and exploiting instruction level parallelism in code will be key to future increases in microprocessor performance. What technical challenges must compiler writers meet to better use ILP? Instruction level parallelism allows a sequence of instructions derived from a sequential program to be parallelized for execution on multiple pipelined functional units. If industry acceptance is a measure of importance, ILP has blossomed. It now profoundly influences the design of almost all leading edge microprocessors and their compilers. Yet the development of ILP is far from complete, as research continues to find better ways to use more hardware parallelism over a broader class of applications.

[ Top ]

Parallelizing Non-Numerical Code with Selective Scheduling and Software Pipelining

S.M. Moon, K. Ebcioglu
ACM Transactions on Programming Languages and Systems, November 1997, Vol. 19, No. 6, pp. pp. 853-898, ACM Press.
© ACM 1997

Abstract

Instruction-level parallelism (ILP) in nonnumerical code is regarded as scarce and hard to exploit due to its irregularity. In this article, we introduce a new code-scheduling technique for irregular ILP called "selective scheduling" which can be used as a component for superscalar and VLIW compilers. Selective scheduling can compute a wide set of independent operations across all execution paths based on renaming and forward-substitution and can compute available operations across loop iterations if combined with software pipelining. This scheduling approach has better heuristics for determining the usefulness of moving one operation versus moving another and can successfully find useful code motions without resorting to branch profiling. The compile-time overhead of selective scheduling is low due to its incremental computation technique and its controlled code duplication. We parallelized the SPEC integer benchmarks and five AIX utilities without using branch probabilities. The experiments indicate that a fivefold speedup is achievable on realistic resources with a reasonable overhead in compilation time and code expansion and that a solid speedup increase is also obtainable on machines with fewer resources. These results improve previously known characteristics of irregular ILP.

[ Top ]

Scalable Instruction-level Parallelism through Tree-instructions

J.H. Moreno, M. Moudgill
1997 International Conference on Supercomputing
Vienna, Austria, July 7-11, 1997, pp. 1-11.
© ACM 1997

Abstract

We describe a representation of instruction-level parallelism which does not require checking dependencies at run-time, and which is suitable for processor implementations with varying issue-width. In this approach, a program is represented as a sequence of tree-instructions, each containing multiple primitive operations. These tree-instructions are executable either in one or multiple cycles, do not require a specific processor organization, and are generated assuming an implementation capable of large instruction-level parallelism. During instruction cache reloading/accessing, tree-instructions are decomposed into subtrees which fit the actual resources available in an implementation. The resulting subtrees require a simple instruction-dispatch mechanism, as in the case of statically scheduled processors. The representation makes practical the use of the same parallelized code in implementations with different issue-width; simulation results indicate that the instruction-level parallelism achievable with this approach degrades less than 10% with respect to code compiled for each specific implementation.
[ Presentation foils (pdf, 80 KB) ]

[ Top ]

DAISY: Dynamic Compilation for 100% Architectural Compatibility

K. Ebcioglu, E.R. Altman
24th Annual International Symposium on Computer Architecture
Denver, Colorado, June 1997, pp. 26-37.
© ACM 1997

Abstract

Although VLIW architectures offer the advantages of simplicity of design and high issue rates, a major impediment to their use is that they are not compatible with the existing software base. We describe new simple hardware features for a VLIW machine we call DAISY (Dynamically Architected Instruction Set from Yorktown). DAISY is specifically intended to emulate existing architectures, so that all existing software for an old architecture (including operating system kernel code) runs without changes on the VLIW. Each time a new fragment of code is executed for the first time, the code is translated to VLIW primitives, parallelized and saved in a portion of main memory not visible to the old architecture, by a Virtual Machine Monitor} (software) residing in read only memory. Subsequent executions of the same fragment do not require a translation (unless cast out).

We discuss the architectural requirements for such a VLIW, to deal with issues including self-modifying code, precise exceptions, and aggressive reordering of memory references in the presence of strong MP consistency and memory mapped I/O. We have implemented the dynamic parallelization algorithms for the PowerPC architecture. The initial results show high degrees of instruction level parallelism with reasonable translation overhead and memory usage.
[ Presentation foils (pdf, 186 KB) ]

[ Top ]


Simulation/evaluation environment for a VLIW processor architecture

J.H. Moreno, M. Moudgill, K. Ebcioglu, E. Altman, B. Hall, R. Miranda, S.K. Chen, A. Polyak
IBM Journal of Research and Development, Vol. 41, No. 3, May 1997
© IBM 1997

Abstract

We describe the environment used for the simulation and evaluation of a processor architecture based on very long instruction word (VLIW) principles. In this architecture, a program consists of a set of tree instructions, each one containing multiple branches and operations which can be performed simultaneously. The simulation/evaluation environment comprises
  • An optimizing compiler, which generates tree instructions in a VLIW assembly language.
  • A translator from VLIW assembly code into PowerPC assembly code which emulates the functionality of the VLIW processor for the specific VLIW program. The emulating code also includes instrumentation for collecting execution counts of VLIWs, profiling information, and generation of predecoded execution traces.
  • A cycle timer, invoked by the emulating code on a VLIW-by-VLIW basis, which processes VLIW execution traces as they are generated.

The environment supports the evaluation of alternatives and trade-offs among the VLIW architecture, its compiler, and processor implementations. Emphasis has been placed on providing fast turn-around time for the development of compilation algorithms, and for an efficient compilation to simulation cycle which allows analysis of architecture/compiler trade-offs over complete execution runs of realistic workloads.

[ Top ]


Compiler/architecture interaction in a tree-based VLIW processor

M. Moudgill, J.H. Moreno, K. Ebcioglu, E. Altman, S.K. Chen, A. Polyak
IEEE TCCA Newsletter, June 1997, pp. 10-12.
Also in proceedings of the Workshop on Interaction between Compilers and Computer Architectures, 1997 High-Performance Computer Architecture Conference (HPCA97), San Antonio, February 1997.
© IEEE, June 1997.

Abstract

This paper describes a compilation and simulation environment designed to explore the interaction among compiler and architecture for the case of a very long instruction word (vliw) processor based on tree instructions. The environment is characterized by its flexibility and fast turnaround time, allowing the exploration of architecture/compiler trade-offs in several dimensions over complete execution runs of standard benchmarks. Chameleon, our research compiler, uses state-of-the-art optimizing techniques to extract and exploit instruction-level parallelism. ForestaPC, the vliw architecture, has an instruction set based on the PowerPC architecture. The exploration of interactions and the evaluation of trade-offs has led to the development of some novel ideas in the architecture as well as in the compiler.

[ Top ]

VLIW Compilation Techniques in a Superscalar Environment

K. Ebcioglu, R. Groves, K.C. Kim, G. Silberman, I. Ziv.
Proceedings of the 1994 Programming Languages Design and Implementation (PLDI) Conference, pp.36-48, June 1994.
© ACM 1994

Abstract

We describe techniques for converting the intermediate code representation of a given program, as generated by a modern compiler, to another representation which produces the same run-time results, but can run faster on a superscalar machine. The algorithms, based on novel parallelization techniques for Very Long Instruction Word (VLIW) architectures, find and place together independently executable operations that may be far apart in the original code, i.e., they may be separated by many conditional branches or belonging to different iterations of a loop. As a result, the functional units in the superscalar are presented with more work than can proceed in parallel, thus achieving higher performance than the approach of using hardware instruction dispatch techniques alone.

While general scheduling techniques improve performance by removing idle pipeline cycles, to further improve performance on a superscalar with only a few functional units requires a reduction in the pathlength. We have designed a new set of algorithms for reducing pathlength and removing stalls due to branches, namely speculative load-store motion out of loops, unspeculation, limited combining, basic block expansion, and prolog tailoring. These algorithms were implemented in a prototype version of the IBM RS/6000 xlc compiler and have shown significant improvement in SPEC integer benchmarks on the IBM POWER machines.

Also, we describe a new technique to obtain profiling information with low overhead, and some applications of profiling directed feedback, including scheduling heuristics, code reordering and branch reversal.

[ Top ]


Making Compaction-based Parallelization Affordable

T. Nakatani, K. Ebcioglu,
IEEE Transactions on Parallel and Distributed Systems 4(9), Sept. 1993, pp. 1014-1029.

Abstract

Compaction-based parallelization suffers from long compile time and large code size because of its inherent code explosion problem. If software pipelining is performed for loop parallelization along with compaction, as in our compiler, the code explosion problem becomes more serious. In this paper, we propose the software lookahead heuristic for use in software pipelining, which allows inter-basic-block movement of code within a prespecified number of operations, called the software lookahead window, on any path emanating from the currently processed instruction at compile time. Software lookahead enables instruction-level parallelism to be exploited in a much greater code area than a single basic block, but the lookahead region is still limited to a constant depth by means of a user-specifiable window, and thus code explosion is restricted. The proposed scheme has been implemented in our VLIW parallelizing compiler. To study the code explosion problem and instruction-level parallelism for branch-intensive code, we compiled five AIX utilities: sort, fgrep, sed, yacc, and compress. It is demonstrated that the software lookahead heuristic effectively alleviates the code explosion problem while successfully extracting a substantial amount of inter-basic-block parallelism.

[ Top ]

A Study on the Number of Memory Ports in Multiple Instruction Issue Machines

S.M. Moon, K. Ebcioglu.
Proceedings of MICRO-26, 1993, pp. 49-58.

Abstract

One of the key design concerns of multiple instruction issue (MII) processors is deciding how many memory ports need to be provided, considering performance and efficiency of the target processor. For an MII processor that exploits instruction-level parallelism (ILP) in non-numerical code, this decision is difficult to make due to its irregularity. The authors perform an empirical study aimed at characterizing a suitable MII organization that best exploits irregular ILP. The study is based on the selective scheduling compiler that performs precise memory disambiguation for concurrent execution of multiple memory operations, along with renaming, speculation, and software pipelining. The result indicates that a small number of memory ports (i.e. less than half of the issue rate) is enough for exploiting most of irregular ILP. The authors also examine related issues such as the utilization of memory ports and additional data cache misses caused by speculative loads.

[ Top ]

An Architectural Framework for Supporting Heterogeneous Instruction-Set Architectures

G.M. Silberman and K. Ebcioglu,
IEEE Computer, Vol. 26, No. 6, pp. 39-56
© IEEE, June 1993.

Abstract

The authors describe a novel architectural framework that allows software applications and operating system code written for a given instruction set to migrate to a different, higher performance architecture. Only minor changes are needed on existing software to support this migration, and the performance advantages can be significant. The framework provides a hardware mechanism for seamless switching between two instruction sets, resulting in a machine that enhances application performance while keeping the same program behavior from a user perspective. The framework is designed to include program exceptions, self-modifying code, tracing, and debugging. High execution speed on migrated applications is achieved through automated compiler-assisted translation of the object code of one machine to that of the other, using advanced global optimization and scheduling techniques.

The result is a path for moving from complex instruction-set computers (CISCs) to newer architectures, such as reduced instruction-set computers (RISCs), superscalars, or very long instruction word (VLIW) machines, while protecting the extensive economic investment represented by existing software. Also, the framework can support runtime application migration between machines whose instruction sets it supports, thus providing a mechanism for load balancing and/or fault-tolerance.

Examples are given for IBM System/390 operating-system code and AIX utilities, showing the performance potential of the scheme using a VLIW machine as the high performance target architecture. The need to present the user an image consistent with the original application results in code size and execution time overheads, which are quantified by examining a few applications.

[ Top ]


An efficient resource-constrained global scheduling technique for superscalar and VLIW processors

S.M. Moon and K. Ebcioglu,
Proceedings of MICRO-25, pp. 55-71
IEEE Press
© IEEE, 1992.

Abstract

We describe a new algorithm for parallelization of sequential code that eliminates anti and output dependences by renaming registers on an as-needed basis during scheduling. A dataflow attribute at the beginning of each basic block indicates what operations are available for moving up through this basic block. Scheduling consists of choosing the best operation from the set of operations that can move to a point, moving the instances of the operation to the point, making bookkeeping copies for edges that join the moving path but are not on it, and updating the dataflow attributes of basic blocks only on the paths that were traversed by the instances of the moved operations. The code motions are done globally without going through atomic transformations of percolation scheduling, for better efficiency. For performing the code motions, we use an intermediate representation that is directly executable as sequential RISC code, rather than VLIW code. As a result, the new algorithm can be used to generate parallelized code for multiple ALU superscalar processors as well. The enhanced pipeline scheduling algorithm for software pipelining of arbitrary code is reformulated within the framework of the new sequential RISC representation. The new algorithm has been implemented, and preliminary results on AIX utilities indicate that it requires significantly less compilation time than the percolation scheduling approach.

[ Top ]

On Optimal Parallelization of Arbitrary Loops

U. Schwiegelshohn, F. Gasperoni, K. Ebcioglu
Journal of Parallel and Distributed Computing 11, 130-134 (1991).
© Academic Press, 1991

Abstract

The problem of automatic loop parallelization has received a lot of attention in the area of parallelizing compilers. Automatic loop parallelization can be achieved by several algorithms. In this paper, we address the problem of time optimal parallelization of loops with conditional jumps. We prove that even for machines with unlimited resources, there are simple loops for which no semantically and algorithmically equivalent time optimal program exists.

[ Top ]

Some Global Compiler Optimizations and Architectural Features for Improving Performance of Superscalars

K. Ebcioglu, R. Groves
IBM T.J. Watson Research Center research report no. RC 16145, 10/02/90, Computer Science, 13 pages.

Abstract

We describe a method for converting a given program in the assembly language of a superscalar machine to another program written in the same assembly language, such that the resulting program produces the same final results as to the original one, and can run significantly faster. The method, inspired by new global parallelization techniques for VLIW architectures, finds and places together independently executable operations that may be far apart in the original code (i.e., that may be separated by many conditional branches or that may belong to different iterations of a loop.) As a result, the pipelined functional units in the machine, which could not be kept busy because of the limited size of the execution lookahead window in the hardware, are given more work to do, and higher performance is achieved. We discuss some new architectural features and software support required for speculative operations, that result from moving operations above conditional jumps as part of the techniques. As a preliminary demonstration of the value of the techniques, an inner loop of the sequential natured SPEC Lisp Interpreter benchmark is automatically parallelized; the result indicates a potential performance of 3.5 RISC instructions/cycle in this inner loop.

[ Top ]

A New Compilation Technique for Parallelizing Loops with Unpredictable Branches on a VLIW Architecture

K. Ebcioglu, T. Nakatani
Languages and Compilers for Parallel Computing,
D. Gelernter, A. Nicolau, D. Padua (eds.), pp. 213-229, Research Monographs in Parallel and Distributed Computing, MIT Press, 1990.

Abstract

Enhanced pipeline-percolation scheduling is a new compilation technique we developed to extract fine-grain parallelism uniformly from nested loops in general software. The compilation technique generates code for execution on the IBM Very Long Instruction Word (VLIW) machine, which is now being built at the IBM T. J. Watson Research Center. The IBM VLIW architecture has features that facilitate high performance on general sequential-natured code with unpredictable branches: a decision-tree shaped instruction capable of performing multiple conditional branches emanating from arbitrary paths in the sequential code, a conditional execution feature that reduces pathlengths, multiple functional units and multiple ports to data memory, and a shared register file which eliminates communication delays between functional units. The preliminary version of the parallelizing compiler for the IBM VLIW machine is now operational both at the IBM Tokyo Research Laboratory and the IBM T. J. Watson Research Center. It is shown that a pathlength reduction of 5-11X over a RISC processor can be achieved on some of the Stanford integer benchmarks and other sequential-natured C programs.

[ Top ]

Combining as a Compilation Technique for VLIW Architectures

T. Nakatani and K. Ebcioglu.
Proceedings of MICRO-22, pp. 43-55. August 1989.
© ACM 1989.

Abstract

Combining is a local compiler optimization technique that can enhance the performance of global compaction techniques for VLIW machines. Given two adjacent operations of a certain class that are flow (read-after-write) dependent and that cannot be placed in the same micro-instruction, the combining technique can transform the operations so that the modified operations have no dependence. The transformed operations can be executed in the same micro-instruction, thus allowing the total execution time of the program to be reduced. In this paper, combining a pair of flow-dependent operations into a wide instruction word is suggested as an important compilation technique for VLIW architectures. Combining is particularly effective with software pipelining and loop unrolling since combinable operations can come together with a higher probability when these compilation techniques are used. We implemented combining in our parallelizing compiler for the wide instruction word architecture, which is now being built at the IBM T. J. Watson Research Center. It is shown that ten percent speedup is obtained on the Stanford integer benchmarks and other sequential-natured C programs, in comparison to compaction techniques that do not use combining. For a class of inner loops, combining can remove the inter-iteration dependences completely and can improve performance in the same ratio as the loop is unrolled.

[ Top ]

A Global Resource-constrained Parallelization Technique

K. Ebcioglu, A. Nicolau
Proceedings, 1989 International Conference on Supercomputing, pp. 154-163, June 1989.
© ACM 1989

Abstract

This paper presents a new approach to resource-constrained compiler extraction of fine-grain parallelism, targeted toward VLIW supercomputers, and in particular, the IBM VLIW (Very Long Instruction Word) processor. The algorithms described integrate resource limitations into Percolation Scheduling-a global parallelization technique- to deal with resource constraints, without sacrificing the generality and completeness of Percolation Scheduling in the process. This is in sharp contrast with previous approaches which either applied only to conditional-free code, or drastically limited the parallelization process by imposing relatively local heuristic resource constraints early in the scheduling process.

[ Top ]

Some design ideas for a VLIW architecture for sequential-natured software

K. Ebcioglu,
Parallel Processing, pp.3-21
M. Cosnard, M.H. Barton and M. Vanneschi (Editors)
Elsevier Science Publishers B.V. (North-Holland)
© IFIP, 1988.

Abstract

In this paper, we describe a Very Long Instruction Word architecture, now being designed at the IBM T.J. Watson Research Center, which is intended to achieve good performance not only in scientific code, but also in sequential, non-numerical code. Communication delays between processing elements are minimized via a single shared register file with a large number of ports. To perform well on programs with unpredictable branches, the architecture features decision-tree shaped instructions that allow multiway branching, and that allow operations to be executed conditionally depending on where the instruction branches to. To add to the parallelism achievable via existing compilation techniques for VLIW architectures, we have developed a compilation technique called pipeline scheduling, which is an extension of the "doacross" and "dopipe" techniques proposed for multiprocessors by D. Kuck's group. This technique can initiate a new iteration of an inner loop (possible containing arbitrary if-then-else statements and conditional exits) on every clock period whenever dependences and resources permit.

[ Top ]

A Compilation Technique for Software Pipelining of Loops with Conditional Jumps

K. Ebcioglu,
Proceedings of MICRO-20, pp. 69-79
© ACM Press, 1987.

Abstract

We describe a compilation algorithm for efficient software pipelining of general inner loops, where the number of iterations and the time taken by each iteration may be unpredictable, due to arbitrary if-then-else statements and conditional exit statements within the loop. As our target machine, we assume a wide instruction word architecture that allows multi-way branching in the form of if-then-else trees, and that allows conditional register transfers depending on where the microinstruction branches to (a hardware implementation proposal for such a machine is briefly described in the paper). Our compilation algorithm, which we call the pipeline scheduling technique, produces a software-pipelined version of a given inner loop, which allows a new iteration of the loop to begin on every cycle whenever dependences and resources permit. The correctness and termination properties of the algorithm are studied in the paper.

[ Top ]

 
  Related Research 
arrow DAISY
arrow LaTTe: an open-source JIT compiler
  More Information
arrow Talks and Presentations
arrow Publications and Patents
arrow Selected Abstracts

 
  About IBM  | Privacy  | Legal  | Contact