IBM Skip to main content
  Home     Products & services     Support & downloads     My account  
  Select a country  
Journals Home  
  Systems Journal  
Journal of Research
and Development
  ·  Current Issue  
  ·  Recent Issues  
  ·  Papers in Progress  
  ·  Search/Index  
  ·  Orders  
  ·  Description  
  ·  Patents  
  ·  Recent publications  
  ·  Author's Guide  
  Staff  
  Contact Us  
Journal of Research and Development  
Volume 42, Number 2, 1998
MultiMedia Systems
 Table of contents: arrowHTML arrowASCII   This article: HTML arrowASCII   DOI: 10.1147/rd.422.0177 arrowCopyright info
   

Evolution and challenges in multimedia

by A. Dan, S. I. Feldman and D. N. Serpanos
This paper outlines the challenges and issues in developing multimedia-based application programs and solutions. It provides a brief history of multimedia-based applications and a critical perspective of the research issues that should be addressed in the future.

Introduction

The integration of multiple media, e.g., voice, video, image, and data, provides an effective means of communication to the users of various services. Because of the advances in computer and communication technologies, creating sophisticated multimedia user interfaces is no longer limited to special-purpose applications. The popularity of the World Wide Web, where most applications currently utilize images and data, is adequate testimony to this fact. The increasing power of electronic circuitry in workstations, personal computers, and consumer electronics, in conjunction with the decreasing cost of high-bandwidth and low-latency communication, have created a large momentum to develop sophisticated multimedia applications as well as to provide new types of services to businesses and homes. These capabilities involve improving user interfaces to many traditional applications as well as creating futuristic applications such as virtual environments and augmented reality [1]. The basic idea behind the latter types of applications is to immerse a user in an imaginary, computer-generated virtual world or to augment the real world (i.e., via augmentation of human perception by supplying information not ordinarily detectable by human senses) around a user with superimposed computer graphics projected onto the walls or onto a head-mounted display. In another example (referred to as telepresence), a multitude of stationary cameras mounted at a remote environment are used to acquire both photometric and depth information. A virtual environment is then constructed in real time and redisplayed in accordance with the local participant's head position and orientation. This allows local users to interact with other remote individuals (say, with a medical consultant) as if they were actually within the same space.

The evolution of multimedia applications can be traced through three major stages. First, even prior to the deployment of delivery-network infrastructure, stand-alone applications (e.g., video arcade games) and CD-ROM-based applications had successfully integrated multiple media, mostly in the form of games, entertainment, and educational materials.

Next, high expectations of technological breakthroughs to make available lower-cost delivery bandwidth (as needed for sophisticated multimedia applications) created a tremendous excitement in the area of multimedia. The potential for the convergence of multiple services (e.g., TV, movie, and telephone) had stirred up the marketplace. Almost every day, newspaper headlines announced new field trials and potential mergers of corporations. The focus in multimedia shifted to the creation of large-scale video servers and delivery infrastructure that would be capable of delivering thousands of simultaneous high-quality video streams to homes and businesses. However, the economics of the marketplace has run counter to these high expectations. For example, in movie-on-demand applications, the cost of storage of a large video library and the cost of delivery bandwidth for a two-hour video per customer were found to be prohibitively expensive in comparison with traditional movie rentals. In addition, the infrastructure for delivering high-quality video was mostly unavailable, particularly to the home. This issue is referred to as the last-mile bandwidth problem. Compounding this problem, most available multimedia contents were created as movies. Video-on-demand applications lacked the depth of content (complexity of presentation) and provided only limited interactivity (e.g., VCR control operations) as compared to Web-based applications. New forms of content have to be created to provide marketable services to users (e.g., distance learning, travel, advertisement, electronic commerce). Furthermore, sophisticated tools must be available for easy creation of such multimedia contents.

The popularity of the Web, which can use very low-bandwidth networks in the last mile, and the lack of deployment of large-scale video servers have laid the groundwork for the current stage of multimedia. Applications using high-bandwidth multimedia streams will be deployed in enterprise (high-bandwidth-network) environments and more slowly (as higher-bandwidth infrastructure is deployed) for residential consumers. Such services will likely follow the near-video-on-demand (NVOD) model [2] in which a large number of clients can share a set of broadcast channels, rather than the video-on-demand (VOD) model, which requires an individual stream per client. For example, the Digital Satellite System (DSS) broadcasts (via satellite) a large number of entertainment streams to its paid subscribers. In time, these services will evolve to include NVOD-like features; e.g., VCR control features will be provided by the mechanism of copying a broadcast stream to the local disk of a client or of switching to a different broadcast channel with later playback of the same material. (Of course, any local copying must take into account the protection of intellectual property rights.)

The primary focus of current research activities, however, is on the creation of multimedia content that can be delivered inexpensively by means of the existing low-bandwidth networking infrastructure as well as on the development of platform-independent applications that can adapt to heterogeneity in environments arising from the differences in capabilities of system components and end-user devices. The new forms of content fall into two categories: low-bandwidth, real-time conferencing, and delivery of composite multimedia documents consisting of images, text, and possibly short audio and video clips. The quality of the presentation will change (i.e., dynamically adapt) with the availability of delivery bandwidth. The business and research focus has shifted from just video or audio quality to information delivery. Video-conferencing applications, for example, are replacing special video-conferencing rooms (which have high operational cost) used in many business environments, and bringing the desired functionality to the desks.

Challenges in multimedia-system design

Efficient multimedia systems are not just simple extensions of conventional computing systems. Delivery of time-sensitive information, such as voice and video, and handling of large volumes of data require special considerations to produce successful widespread applications. Consequently, all system components, e.g., storage, operating systems, and networks, must support these additional requirements. This results in very different architectural and design decisions in all subsystems and their services. More specifically, some of the important issues that have to be addressed when one is designing a multimedia system are

  • Storage organization and management.
  • Available physical bandwidth in the delivery path to the users.
  • Quality-of-service (QoS) management (real-time delivery and adaptability to the environment).
  • Information management (indexing and retrieval).
  • User satisfaction.
  • Security (especially management of content rights).
A generalized environment supporting multimedia applications to homes or to business environments is shown in Figure 1, in which U1 represents the user at home having access to an application or service through an access network, while a business user (U2) is connected through a local area network, i.e., a customer-premises network. A multimedia server delivers services (i.e., multimedia data streams) over a series of (possibly) heterogeneous networks to the clients. (Further processing can take place within the network, e.g., multicasting, transformation of content.) As the figure illustrates, the need to provide flawless service to users places significant requirements on a large number of components, which have to work well together. In order to cope with the complexity of several heterogeneous systems involved in the end-to-end service, one can systematically analyze the components in order to evaluate various options and the design challenges. The design issues are broadly classified into the following categories:

  1. Server subsystems that store, manage, and retrieve the multimedia data stream(s) upon a user request.

  2. Network subsystems that transport, deliver, adapt (and perhaps transform) the data streams isochronously (i.e., at a specified rate, without observable delay) to the clients.

  3. Client subsystems that receive and/or prefetch the data streams and manage the presentation of data.

  4. Application programs that deal with relationships among data frames and media segments, and manage user navigation and retrieval of this data.

Figure 1Figure 1

Multimedia technology has to address all of the components in coordination, since each component has evolved individually up to now. The following sections describe some of the major challenges to be overcome for modern multimedia applications to become ubiquitous and as influential as we predict. The discussions are not intended to be exhaustive, but to refer to significant research activities that are described in this issue of the IBM Journal of Research and Development or are under way at IBM and elsewhere. For a broader introduction to and survey of the multimedia area, the readers should refer to a companion paper in this issue [3].

  Server-design issues
Multimedia servers store and manage multimedia documents and deliver data streams in an isochronous manner, in response to delivery requests from users. They may also process the stored information before delivery to users. In the future, multimedia servers may range from low-cost, PC-based simple servers that deliver a few streams, to scalable large (either centralized or distributed [4, 5]) servers that will provide thousands of streams. The content may range from long, sequentially accessed videos to composite documents consisting of a mixture of small multiple-media segments (e.g., video, audio, image, text).

The multimedia objects that are large and are played back sequentially require relatively large storage space and playback bandwidth. Therefore, considerable initial research has been directed toward a) efficient retrieval of data from disks, in order to amortize seek time and to avoid jitter [6], b) striping of data across storage devices for load balancing [7], and c) scheduling of system resources to avoid jitter [8-10]. Retrieval of large data blocks is now a well-accepted practice in most commercial servers [7, 8]. The use of double buffering to improve bandwidth utilization without incurring jitter can be cost-effective [11]. While simple striping is effective for load balancing across a set of homogeneous storage devices for long sequential accesses, complex block-allocation policies that take into account bandwidth and storage capacities of the devices are necessary. An alternative approach is to partition the storage devices into sets of homogeneous striping groups. A higher-level policy assigns media objects to these devices, matching the storage and bandwidth requirements of the documents to the availability of those in the striping groups [12]. Another related problem is dynamic load balancing across servers or striping groups [13]. For example, in [14] a policy of temporarily replicating "hot" data segments (i.e., frequently accessed parts of the media objects) by copying retrieved data in playback streams is used to dynamically balance the load across striping groups. Such a dynamic policy is complementary to the initial data-placement policy.

The low-cost servers that are designed to serve a few dozen streams may be simple, PC-based servers and may not employ various resource-optimization techniques (other than retrieval of large data blocks). However, large-scale servers that can serve thousands of streams exploit the economies of scale and can be quite complex [8]. For example, in a video-on-demand server, multiple requests for the same popular video that arrive within a short time period can be batched for service by a single playback stream. Various video-scheduling policies take into account user satisfaction while improving effective server capacity [15]. Another optimization technique known as bridging speeds up the following streams while slowing down the preceding streams for the same video in order to achieve an eventual merging of these streams, and hence to avoid multiple playback streams [16]. The main memory buffer of the server can be used to cache retrieved data for later playback. A technique known as interval caching uses buffers to cache the short intervals between successive playback streams of the same video [17]. This caching can also be used for balancing changing loads across storage devices [17].

To guarantee QoS and avoid jitter, multimedia servers employ admission control; i.e., a new request is granted only if enough server resources are available. This implies that a video server has to know in advance the number of streams it can serve without jitter. The capacity of a video server can be expressed in statistical terms, e.g., the number of streams it can serve with a certain level of jitter [11]. For a server complex comprising many different components, the capacity is expressed not by a single number but by means of a graph model with individual capacities of the components [8]. The admission control relies on finding a delivery path from the storage devices to the selected network interface and reserving appropriate capacities on all components on this path for normal playback. The admission control may also rely on resource-optimization techniques, e.g., batching and caching. VCR control operations (e.g., pause, resume, fast-forward) introduce new challenges for efficient use of resources. Resources may be released and reacquired upon resumption. However, it may not be possible to guarantee that resources are available upon resumption. For example, a new playback stream must be created upon resumption of a client stream that was originally served via batching. In large-scale servers, a small amount of capacity may be set aside as contingency capacity for dealing with VCR control operations [18].

In addition to addressing the above resource optimization issues, the next generation of multimedia servers must address the challenges in retrieval of composite multimedia documents. A composite multimedia document consists of many different parts (e.g., video and audio clips, image, text, and data), with very different storage and bandwidth requirements, that are to be presented in a synchronized fashion. Therefore, the bandwidth requirement varies over time. Fine-granularity, time-varying advanced reservation of resources along with prefetching of data can be used to improve resource utilization [19]. The issues must be addressed both at the clients and at the servers. The prefetch schedule for various media objects should also take into account the bandwidth fragmentation in a server where different components may be located in different storage devices.

  Network challenges
Network subsystems create delivery paths from the multimedia servers to the individual clients, and transport the media objects. To avoid long latency and the need for large client buffers, some media objects (e.g., large video and audio files) are streamed isochronously to the clients from the servers. [Note, however, that not all media delivery requires isochronous streaming, as is the case when the user browses noncontinuous media (e.g., text, image).] Table 1 illustrates the Open System Interface (OSI) layers and their services that should be addressed for end-to-end multimedia applications. Starting from the lowest layers, one finds challenges that are directly or indirectly related to providing higher quality of service to the user.

Table 1   Network layers.

Multimedia application (OSI 6-7)
Transport (OSI 4-5-6)
Network
Data-link control
Physical

Networks, covering the lowest four layers, have to meet several challenges in the direction of multimedia services. New technologies are required to bring high bandwidth to the user. In business environments there is typically a LAN attachment for each user station, which provides shared or point-to-point bandwidth of the order of 10 Mb/s or better. However, the network to the home is currently limited to a few tens of Kb/s (or to 128 Kb/s in the case of ISDN), because of the current telephone network infrastructure. The evolution of technologies such as ADSL [20, 21] and HFC (CATV) [20, 21] will bring bandwidth of the order of Mb/s to the home, which is a requirement for the successful deployment of real-time multimedia to the home.

Raw bandwidth is not enough for effective delivery of multimedia services, especially when this bandwidth is shared among several systems or applications; it is necessary to provide mechanisms that achieve delivery of voice and video information isochronously, so that the corresponding network connections allow users to enjoy flawless delivery of media information. The general concept of QoS, which has been introduced to broadband ISDN, is used in multimedia applications. Although the term QoS is general, it implies all of the properties that are required for a flawless end-to-end service, i.e., delivery of information isochronously, so that the end user experiences a continuous flow of audio and/or video without interruption or noticeable problems of degraded quality.

Network technology and the communication protocol stack(s) used for multimedia information transmission clearly play an important role in achieving the above goals, i.e., to provide high bandwidth to the user with the appropriate QoS. The wide spectrum of technologies in all areas of networking--local, metropolitan, and wide-area--require the development of interconnecting systems (adapters, bridges, routers, switches, etc.), which support high-speed communication among heterogeneous or similar networks with low latency and low cost. The challenges in this area are numerous: minimization of data-transmission and data-reception overhead in order to preserve link bandwidth for applications [22, 23], low-latency protocol processing [22, 24, 25], and high-speed routing [26, 27], to mention a few. These problems originate from data networking; the stricter latency requirements of video and audio, in conjunction with the differing standards that may be adopted at the various networks, impose even stricter requirements. For noncontinuous, media-based applications, latency is often far more important than jitter, since it may affect the interactive response time. For certain networking technologies (e.g., cable modem with asymmetrical bandwidth, using TCP/IP), this latency can be quite high.

Depending on the level and protocol stack used for the service, different problems may arise; for example, transmission over the Internet or intranets with the use of the currently widespread TCP/IP protocol suite does not guarantee in-time delivery of time-sensitive information, while MPEG** transmission with the use of the AAL-5 protocol over an ATM network requires that the network have bounded jitter [21].

Considering that one of the main targets of multimedia applications is provision of services to users in the same fashion as telephone service, e.g., in kiosks and public multimedia terminals, it is imperative that the network infrastructure support security to appropriate levels, in order to protect customers and ensure privacy. Tamper-free systems, efficient authentication mechanisms, and efficient and reliable transmission of encrypted information are a necessity in such environments. In environments where information is transmitted encrypted and/or encoded (for compression), special attention has to be paid to the recovery of lost information in the network.

  Client-design issues
The design of the client will, as always, be dominated by questions of cost, functionality, and comfort. There are several possible forms for the client device: an upgraded television set (with internal or external "set-top" capability), an optimized personal computer (a network computer with video capability), or a full-fledged personal computer. If the control device is shared by all access devices in the household as part of a residence gateway, higher functionality may be possible than if the circuitry must be replicated in each of the many TVs in a residence. The pressure to restrict the amount of RAM and video-processing power will continue for some years, but limiting buffer sizes and local computing will affect the load on the network. (Small buffers and low processing capability cannot handle highly compressed video, i.e., compression across multiple frames. Also, prefetching and/or double buffering are required to maintain the quality of the displayed image, i.e., jitter reduction, smoothing.) Many functions can be implemented most easily if there is significant local secondary storage, but servicing of moving parts, such as disks, in a home-video environment could be a detriment to sales.

The restrictions on local capabilities will certainly affect which functions are performed well locally and which depend on server or network support. Classic VCR control operations can be implemented in many different ways [17, ]; the tradeoffs depend crucially on the split of capability throughout the system. For example, the server can respond to the commands or switch from VOD to NVOD mode, an intermediate node can store entire streams, or the client can manage a cache.

There are a variety of physical methods for enabling user interaction, ranging from standard infrared remote controls to upgraded remote controls to traditional computer pointing devices (e.g., a mouse, TouchPoint). Occasionally, alphabetic or numeric inputs are required, at which time a real keyboard or keypad is desirable. In addition, audio and video inputs (for video conferencing) will be increasingly common, resulting in significant implications for upstream bandwidth and control.

Apart from the above hardware considerations, decisions about the software interfaces, control paradigms, and operating system, such as which functions will be static and which will be loaded with the content or upon need, will have significant impact on cost and usability of the client. A fixed set of capabilities is easy to manage but restricts future upgrades and improvements in types of content that can be viewed. As multimedia applications evolve from simple replay to the inclusion of complex search [28-30] and simultaneous streams [19, 31-33] and further to personalized environments, the client devices (e.g., set-top box, camera, microphone) are likely to be difficult to use. This will result in challenges with regard to optimizing resources and avoiding misbehavior and mischief, just as with current personal computers.

  Application-design issues
Future multimedia applications will utilize valuable and enjoyable types of information, including multiple audio and video streams and time-, user-, and data-dependent behaviors. Multiple applications that operate on the same data will have to manage information, analyze images, search for appropriate items, and mesh live, pre-recorded, and interactive streams. The software framework must make it easy for application developers to create software that satisfies these and other needs.

Creating advanced content will become increasingly challenging. It has always been difficult to produce videos that are pleasant to watch. Advanced tools (e.g., for authoring multimedia content [31, 31]) will be required to make it possible to combine multiple streams in attractive ways that have maximum impact. However, addressing human perceptions and psychological effects of a composition will always remain the realm of creative artists.

Searching image, audio, and video data is far more difficult than doing the same for text, but the rewards can be correspondingly higher. The adage "A picture is worth a thousand words" suggests not only the higher value that people put on multimedia data, but also the higher cost of image and video data. Approaches to finding desired information include creation and analysis of text annotations [28, 29], examining reduced versions of the video (thumbnails or key frames) [28], and full- scale analysis of high-resolution data [30]. Practical considerations force us to maximize the value received from a small amount of computation. The requirements become particularly onerous in interactive and real-time situations such as push technology [34], browsing, and interactive navigation in large sets of multimedia data [30].

Presenting the information is also difficult, especially when system and network capabilities are limited. "Best-effort" networks (with no guarantee of on-time delivery) create challenges for synchronizing audio and video substreams and require adaptation to feedback about user behavior as well as network and server performance. The adaptation may take into account QoS measurements by the operating system and other system components [35], heterogeneous media [36], and heterogeneous environments [37]. Managing controlled collaborations and teleconferences with many audio, video, and data streams is even more challenging [38, 39]. In the future, people will want even more complicated composite moving media, with complex constraints on location and timing of the various parts [31-33].

Interpretations of media data can also vary from user to user. Thus, there is a need for a framework to accommodate views of different users on the same data and to provide mechanisms for creating and updating views. The BRAHMA [29] framework proposes an approach for annotating media data (i.e., creating views) via user-defined attribute-value pairs for hierarchical organization and retrieval of media data.

Finally, virtual views can be created from a set of image data. Conventional virtual reality and 3D applications, including those based on virtual reality markup language (VRML), allow continuous movement of the viewpoint but often suffer from poor performance, crude 3D models, large data files, low-quality rendered images of synthetic scenes, and the need for expensive hardware. PanoramIX [40] alleviates these problems by employing image-based rendering algorithms. Although the viewpoint is restricted to discrete points in space, the "look-around performance" is much better, the resulting images are very realistic, the data files are relatively small, and no special hardware is needed. In addition, PanoramIX provides a composite-media environment for embedding and hyperlinking disparate media types, such as streaming audio, video, 1D-3D sprites, and synthetic 3D objects [40].

Summary and conclusions

Up to now, the lack of infrastructure for delivery of high-quality video, the weakness of supporting technologies for easy content creation, the absence of experience with the benefits of futuristic multimedia, and a variety of economic factors have impeded the widespread deployment of multimedia applications. Experience with video on the Web and availability of the Internet for disseminating low- and medium-bandwidth products are breaking down these barriers. People now accept the enormous value of widespread use of multimedia technologies to support societally desirable, financially profitable, and personally rewarding services such as distance-independent learning, low-cost video telephone service, improved customer-relationship and help services, new forms of entertainment and interactive games, and new types of electronic commerce (such as tourism, real-estate, and merchandise sales). There are still many technology barriers, but the papers in this issue of the IBM Journal of Research and Development demonstrate that we are making significant progress at overcoming them.

Acknowledgments

We thank Marc Willebeek-LeMair, Nelson Manohar, Jai Menon, William Pence, Junehwa Song, and anonymous referees for their helpful comments and feedback.

**Trademark or registered trademark of Moving Picture Expert Group.

References

Received January 26, 1998; accepted for publication February 2, 1998