|
The current focus of Distributed and High Performance Computing research is on improving the performance and manageability of distributed infrastructure for business computing.
Service-oriented Computing (SOC) is emerging as a promising paradigm for dealing with heterogeneity and agility (both functional and non-functional) requirements of enterprise IT infrastructure in a cost-efficient manner.
We have been working in open standards-based compositional systems since our inception, demonstrating how we can analyze data and control dependencies of a composite service so that we can partition the code into smaller components that would facilitate decentralized execution of the service. The goal has been to achieve higher levels of performance in terms of throughput, scalability and lower response times by utilizing decentralized orchestration. Recently another area that has been our focus is the algorithmic analysis, at design time, of a composite application along with the target IT environment to predict the non-functional properties (such as response time, throughput, availability, etc). We are defining new and easy ways to construct a ‘total picture’, using the metadata available in the IT environment, and analyze this total picture to predict performance. We leverage standards like XML, WS-*, and use techniques from performance modeling, algorithmic analysis, systems management, statistical inferencing, and programming technologies & software engineering. The following projects capture our ongoing and past research.
- iQuilt: Bridging the chasm between business and IT
- Symphony: Decentralized orchestration of composite Web services Grid Services
- Grid Services
- eUtilities
Health monitoring and fault management of an enterprise IT infrastructure is dynamic and continuously evolving, where clusters of application servers serve web requests with backend systems compriseding of database servers and legacy systems which cater to data and information. Typically multiple copies of services execute within these servers, and applications, servers, nodes, backend systems, etc may start and stop in a possibly unscheduled manner. Any fault or performance anomaly at any of these elements interacts with other elements in a rather intricate manner, and manifests itself at the application and business process layer as a functional or non-functional problem. Our focus is on problem determination and resolution aspects in such enterprise infrastructure, using statistical modeling, estimation, and inferencing techniques. The following are our active and past projects in this area.
- Problem determination and localization in dynamic enterprise middleware systems
- Problem determination and prediction in large scale e-business systems
|
The bi-annually updated TOP500 list (www.top500.org) of the world's 500 most powerful computers based on their performance using the LINPACK benchmark (www.netlib.org) is both gratifying and interesting. According to the last available report (November 2006), IBM's BlueGene/L installed at the Department of Energy Lawrence Livermore National Laboratory is the fastest supercomputer in the world. In addition, four of the top five most powerful computers in the world are from IBM.
Simultaneous with the above trend is the increasing amount of data being generated in retail, finance, insurance and government, and the requirement for these enterprises to be able to quickly analyze these large volumes of data. Design innovations that produce teraflops while consuming a fraction of the power has prompted the consideration of these massively parallel
machines to non-scientific workloads as well.
Perhaps, now more than ever before, these massively parallel machines with their unique topologies and interconnection fabrics require algorithms, data flows, and backend optimizations to fully define their computational limits.
The focus of the HPC group is to design cutting-edge parallel programs as well as to analyze and improve the performance of engineering, scientific, and business applications on high performance platforms.
The following list highlights some of our recent and ongoing work:
- Performance on multi-core processors
- Mathematical libraries, business intelligence and analytics
applications
- Blue Gene performance
- HPCC, SPEC, financial and scientific applications
- Cluster performance
- Impact of noise on parallel program performance
- Grid performance
- Data and compute co-scheduling
- Life science and medical imaging applications
- Checkpoint and fault isolation for Blue Gene/L
Recent publications resulting from our work appear in the bibliography. Three of the recent papers appear in the following pages and a short summary of these papers appear below.
- The paper titled "Algorithmic Ramifications of Prefetching in Memory Hierarchy" by Akshat Verma (IBM -- India Research Lab) and Sandeep Sen (IIT Delhi) won the Best Paper Award (Track: Algorithms and Applications) at the 13th International Conference on High Performance Computing, 2006. The work answers two key questions: (i) Are the models of hierarchical memory used for algorithm design in high performance architectures with prefetching-support (e.g., BlueGene) effective in predicting the performance of the algorithms (ii) If the models need to be enhanced, what are the changes required in the memory models and the algorithms. The authors show that existing I/O models are insufficient to capture prefetching and the running times for various algorithms predicted by these models may be off by a factor of hundreds. The authors then propose the 'Prefetch Model' that faithfully captures prefetching, and propose algorithms that are optimal in the Prefetch model.
- The paper titled "Software Routing and Aggregation of messages to optimize the performance of the HPCC Randomaccess Benchmark" by Rahul Garg (IBM -- India Research Lab) and Yogish Sabharwal (IBM -- India Research Lab) was a runner-up to the best paper at the 20th ACM International Conference on Supercomputing, 2006. This work is based on improving the performance of the HPCC Randomaccess benchmark on Blue Gene/L that has led to IBM/LLNL winning the HPCC Class I award for this benchmark for two years in a row (SC 2005 and SC 2006). The work shows that, on many systems, the bisection bandwidth of the network may be the performance bottleneck of this benchmark. An aggregation and software routing based technique is suggested and performance results obtained by using the proposed technique on Blue Gene/L are provided.
- The paper titled "Large Scale Drop Impact Analysis of Mobile Phone Using ADVC on Blue Gene/L" was the Gordon Bell Prize Finalist at the 20th ACM International Conference on Supercomputing, 2006. The authors, from several geographies and organizations, including IBM – India Research Lab, reports on optimizations done to enable drop impact analysis of a mobile phone at a scale that is unprecedented in the electronics industry.
|
- Akshat Verma and Sandeep Sen
"Algorithmic Ramifications of Prefetching in Memory Hierarchy", HiPC, 2006. (Best Paper Award)
- Rahul Garg and Yogish Sabharwal
"Software Routing and Aggregation of messages to optimize the performance of the HPCC Randomaccess Benchmark", ACM Supercomputing, 2006 (runner-up to the best paper award)
- H. Akiba, T. Ohyama, Y. Shibata, K. Yuyama, Y. Katai, R. Takeuchi, T. Hoshino, S. Yoshimura, H. Noguchi, M. Gupta, J. Gunnels, V. Austel, Y. Sabharwal, R. Garg, S. Kato, T. Kawakami, S. Todokoro and J. Ikeda
"Large Scale Drop Impact Analysis of Mobile Phone Using ADVC on Blue Gene/L", ACM Supercomputing, 2006 (Gordon Bell Prize finalist).
- Rahul Garg, Vijay K. Garg and Yogish Sabharwal
"Scalable Algorithms for Global Snapshots in Distributed Systems", International Conference on Supercomputing, 2006.
- Rahul Garg and Yogish Sabharwal
"Optimizing the HPCC Randomaccess Benchmark on Blue Gene/L Supercomputer", SIGMETRICS/Performance, 2006
- V. Krishna Nandivada and Suresh Jagannathan
"Dynamic State Restoration Using Versioning Exception". Journal of Higher Order Symbolic Computation, volume 19(1), pages 101-124 March 2006
- V. Krishna Nandivada and Jens Palsberg
SARA: Combining Stack Allocation and Register Allocation. In European symosium on compiler construction, CC, pages 232-246, 2006.
- Pradipta De and Rahul Garg
"Empirical Evaluation of Impact of Noise on Large Clusters", HiPC 2006.
- Saurabh Aggarwal, Rahul Garg and Nisheeth Vishnoi
"The Impact of Noise on Scaling of Collectives: A Theoritical Approach", HiPC, 2005.
| |
 |
|