IBM®
Skip to main content
    Country/region change    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

XMT


This paper presents an overview of XMT.




XMT: MPEG-4 Textual Format

for Cross-Standard Interoperability

Michelle Kim

IBM T.J Watson Research
19 Skyline Drive
Hawthorne, NY. 10532 USA
1-914-784-7709

mykim@us.ibm.com

Steve Wood

IBM T.J Watson Research
19 Skyline Drive
Hawthorne, NY. 10532 USA
1-914-784-7309

woodsp@us.ibm.com

ABSTRACT

This paper describes the Extensible MPEG-4 Textual format (XMT) that is a work-in-progress standardization effort targeted to be published as MPEG-4 Systems 2001 Edition Amendment 2) and is a framework for representing MPEG-4 content using a textual syntax. The XMT allows content authors to exchange their content with other authors, tools, or service providers, and facilitates interoperability with both the X3D, developed by the Web3D consortium, and the Synchronized Multimedia Integration Language (SMIL) 2.0 from the W3C consortium.

Categories and Subject Descriptors
[Multimedia Tools] Composite multimedia coding standards, interoperability

General Terms
Standardization, Languages.

Keywords
MPEG-4, VRML, SMIL, interoperability.

1. INTRODUCTION

MPEG-4 [1] is an ISO/IEC standard developed by MPEG (Moving Picture Experts Group) for representing and compressing interactive, audio-visual scenes. The standard defines a set of tools for the coded representation of individual audio-visual objects, text/graphics and synthetic objects. An MPEG-4 scene defines the interactive behavior of these coded objects and the way they are composed in space and time. The scene description is coded in a binary format known as BIFS (BInary Format for Scenes).

The audio-visual streams in MPEG-4 are called Elementary Streams whose coding is defined in the Object Descriptor (OD) Framework that describes the streams and their relationships to the scene and with each other. The OD Framework also defines streams for Object Content Information (OCI), MPEG-J (Java-based programs known as MPEGlets), and Intellectual Property Management and Protection (IPMP).

The MPEG-4 scene description, coded in BIFS, is sent to the receiving terminal along with other coded visual/audio objects in further elementary streams. MPEG-4 provides a scene update mechanism to modify the scene dynamically at the receiver and an OD update mechanism to dynamically update Object Descriptors to add or remove media stream objects.

The MPEG-4 specification includes parts for Systems, Visual and Audio. The XMT is a textual representation, using XML, of MPEG-4 Systems (ISO/IEC 14496-1). The MPEG-4 specification provides conformance points to ensure interoperability of bitstreams with receivers so that they can decode and render the content. MPEG-4 allows the representation of content that is both complex and highly interactive, as well as, of course, much simpler content. Binary coded MPEG-4 content can be stored in the MPEG-4 Systems file format commonly known as the mp4 format. While the mp4 file is exchangeable it is often difficult to subsequently use it to edit or re-purpose the stored content. This is because the binary coded representation often cannot be "reverse-engineered" in a consistent manner to represent the content author's original intentions.

The XMT has been designed to provide an exchangeable format between content authors while preserving the author's intentions in a high-level textual format. In addition to providing an author-friendly abstraction of the underlying MPEG-4 technologies, another important consideration for the XMT design was to respect existing practices of content authors such as the Web3D X3D, W3C SMIL and HTML.

The XMT is suitable for many uses including manually authored content as well as machine-generated content using multimedia database material and templates. The XMT may be encoded and stored in the exchangeable mp4 binary file or may also be encoded directly into streams and transmitted. XMT encoding and delivery hints exist to assist this process.

1.1 Cross-Standard Interoperability

The XMT format facilitates content interchange between SMIL players, Virtual Reality Modeling Language (VRML) [4] players, and MPEG-4 players. The XMT-O format can be preprocessed and played directly by a SMIL player, preprocessed to the corresponding X3D nodes and played back by a VRML/X3D player, or compiled to a binary MPEG-4 representation, such as mp4, and played by an MPEG-4 player.

Figure 1 presents a graphical portrayal of the interoperability of the XMT. MPEG-7 [5], as shown in the figure, is an emerging standard for describing multimedia content. The integration of MPEG-7 with XMT, which is currently work in progress, can enable the content-based retrieval of MPEG-4 objects.

XMT Interoperability figure

Figure 1. - Interoperability of XMT with other standards

In this paper we provide an overview of the XMT, with a focus on its cross-standard interoperability with the Web3D X3D and the W3C SMIL 2.0. In section 2, we describe the two-tier XMT architecture. XMT-A is summarized in Section 3, and XMT-O is summarized in Section 4. We present in Section 5 cross-standard interoperability experiments that have been conducted, and Section 6 concludes the paper.

2. XMT TWO-TIER ARCHITECTURE

The XMT consists of two levels of textual syntax and semantics: the XMT-O format and the XMT-A format.

The XMT-A is an XML-based version of MPEG-4 content that closely mirrors its binary representation. The goals of the XMT-A format are to provide a textual representation of MPEG-4 Systems binary coding, including supporting a deterministic one-to-one mapping to the binary representations for conformance, and to be interoperable with the X3D [[2]], which is being developed for VRML 200x (X3D) by the Web 3D Consortium.

XMT-A contains a textual representation of BIFS for both the nodes and the commands for scene updates. The XMT textual format for the nodes is fully compatible with the XML representation of the subset of X3D that is contained within MPEG-4. MPEG-4 Systems scene description originally used VRML 97 as a base and added further nodes of synthetic representation, audio and 2D support. VRML 97 is an ASCII textual representation while X3D is based on XML. Currently the X3D specification has additional function that is not present in MPEG-4; hence XMT contains its subset rather than complete coverage. The XML for additional nodes in MPEG-4 has been based on the same representation philosophy defined by X3D that originally guided their XML conversion from VRML plain text. This strategy will facilitate change if future harmonization efforts between MPEG-4 and X3D make more functionalities common.

XMT-A also has textual representations of features unique to MPEG-4 Systems, not found in X3D, such as the Object Descriptor Framework that includes ODs (Object Descriptors), other descriptors and commands. The ODs are used to associate scene description components to the Elementary Streams that contain the corresponding coded data. These include visual streams, audio streams as well as animation streams that update elements of the scene more efficiently than BIFS commands for complex animations. XMT-A also includes a textual representation to allow MPEG-J streams to be created either from Java classes or zip files.

The XMT-O is a high-level abstraction of MPEG-4 features designed around the W3C SMIL 2.0 [[3]], an XML-based language that allows authors to create dynamic, interactive multimedia presentations. The W3C Synchronized Multimedia (SYMM) working group, at this time of writing, is however still developing the SMIL 2.0 standard, the successor to SMIL 1.0. Also, as the MPEG standardization process is not yet complete for XMT, the XMT syntax and samples as provided in this paper are based on the latest version of the XMT specification.

Using XMT-O authors can describe the temporal behavior and layout media objects to form multimedia presentations, as well as associate hyperlinks with media objects in the presentation. Animation and interactive event based behavior can also be easily expressed, as well as transition effects. Unlike SMIL the XMT has features that enable control over many intrinsic properties of the media objects in support of the capabilities present in MPEG-4.

For each XMT-O construct, one or more possible mappings exist to MPEG-4 Systems that can represent the same content. Such mappings may be represented in XMT-A by various combinations and sequences of XMT-A elements. From this discussion it can be concluded that there is no single deterministic mapping between the two levels, for the obvious reason that a high-level author's intentions can be expanded to more than one sequence of low-level constructs. However, with such non-determinism comes flexibility. XMT-O content can thus be converted and targeted to MPEG-4 players of varying capability. For example, animation in a scene may be mapped using BIFS commands for a simple player, for a more capable player the animation may be mapped, potentially more efficiently, to an animation stream.

Within XMT-A and XMT-O a set of common elements such as MPEG-7, authoring elements, encoding hints, delivery hints and publication hints have been identified and collected into a common section used within both formats that is designated XMT-C.

3. XMT-A FORMAT

The XMT-A format is a direct textual representation of MPEG-4 Systems which, at its core, has a compatible scene graph representation of nodes that MPEG-4 has in common with X3D, developed by the Web3D consortium. Such compatibility facilitates content interchange and interoperability with X3D.

XMT-A contains a textual representation for the MPEG-4 Systems BIFS (BInary Format for Scenes) as well as the Object Descriptor Framework. The scene description declares the spatial-temporal relationship of audio, visual, graphic and synthetic objects. This includes the nodes for the scene description, routes that connect fields between nodes and BIFS commands to deliver and update the scene.

The Object Description Framework specifies the elementary stream resources that provide the time-varying data for the scene and consists of a set of descriptors that allows the identification, description and association of elementary streams to objects in the scene description and to each other. A textual representation of the MPEG-4 Object Descriptor Framework includes the descriptors, commands to update the descriptors and also representation for elementary stream data.

Elementary stream descriptors contain information about the stream type, configuration information for the decoding process and dependencies between streams for scaleable, layered codecs. Alternate stream representations may be present and language descriptors describe language-based alternatives.

Some elementary streams, such as BIFS and OD, have textual representations to directly create the streams via an encoding process. Other elementary streams, such as video and audio, have no textual representation of the media itself and external media sources are referenced to provide the stream data itself.

When MPEG-4 content is represented as a document using XMT-A, the XML document instance as it's known, will be structured according to defined rules for element placement. These rules are specified using the Schema language, developed by the W3C, to provide the definition of XMT-A.

When encoding XMT-A into binary MPEG-4 allows alternate coding schemes, e.g. list versus vector, which will produce alternate binary representations that are all legally valid. XMT-A provides a set of encoding rules to ensure content is coded consistently and also to support deterministic coding for conformance.

4. XMT-O FORMAT

The goals of the XMT-O format are to provide ease of use for content creation, to facilitate content exchange among authors and authoring tools, and to provide a content representation that is compatible and interoperable with the Synchronized Multimedia Integration Language (SMIL) 2.0.

XMT-O describes audio-visual objects and their relationships at a high level, where content requirements are expressed in terms of the author's intent, using media, timing and animation abstractions, rather than by coding explicit nodes and route connections. Higher-level constructs facilitate, among other aspects, content exchange between authors and authoring tools, and content re-purposing.

An algorithm, or an authoring tool, would compile (or map) an XMT-O construct into MPEG-4 content, i.e., into BIFS, OD and media streams etc., along with any appropriate audio/visual media compression or conversions that may be required. Media sources can be of a variety of formats native to the machine where the algorithm is executing, and it is the responsibility of the tool, during the compilation phase, to convert media to suitable target formats for MPEG-4 and appropriate bit-rates etc.

In converting the XMT-O format to MPEG-4 there is not necessarily only one mapping for each construct. MPEG-4 nodes and routes are very powerful tools and there can often be more than one way to represent XMT-O constructs. Also, as MPEG-4 nodes can be 'wired' together with routes in many combinations, it is often difficult to reverse engineer an author's intent from a collection of nodes and routes. When confronted by content, containing many nodes and routes, the re-authoring and maintenance can be quite challenging, if the high-level view of that presentation must be inferred. The XMT-O, however, provides a high-level view with high-level authoring constructs and thus facilitates content exchange, rapid content re-purposing, re-authoring and ongoing maintenance of content.

Recognizing though that some authors may wish to access low-level nodes/routes, XMT-O allows the embedding of the XMT-A node and route definitions, within an identified low-level escape section, to create custom media constructs.

4.1 Re-using SMIL in XMT-O

To create the XMT-O format, as a high-level abstraction and representation of content authors' intentions, SMIL is used as a basis. SMIL is an XML-based language that allows authors to write interactive multimedia presentations. The main strengths of SMIL are that its constructs are self-describing, it is based on XML which provides an excellent format for interchange of data among different applications, it is relatively easy to author, and it is a language familiar to many content creators. The language is also extensible so that new objects or metadata can be inserted easily into the representation.

The XMT-O format re-uses as a base a subset of the modules defined by SMIL where the functions and semantics are compatible. To that base a new set of elements have been designed explicitly for XMT-O that express a high-level view of MPEG-4 specific content. The XMT-O format has not specifically been designed as a playback format, unlike SMIL, rather it is intended that the constructs are mapped and compiled to a binary MPEG-4 representation. However XMT-O may also be processed for exchange with SMIL or mapped to X3D within the limits of compatible capabilities.

To enable reuse of SMIL defined functions, the SMIL language has been composed into number of functional areas that have been broken down to a finer granularity using modules. These modules, comprising XML element, attributes, and attribute values, can then be combined and brought together in other host languagessuch as XMT-O. XMT-O is referred to as a host language since it integrates, or hosts, the modules within a larger set of XML representation. SMIL provides requirements and guidelines for integration of these modules into other host languages and. XMT-O follows these rules, as far as possible, to preserve compatibility with SMIL and both adhere to the semantics of the modules as well as their syntax. SMIL is a language that has been designed and then implementations for it follow. MPEG-4 however, is as existing binary specification and implementation, and XMT-O is being developed as a high-level language to represent it.

XMT-O must be mapped (compiled) into MPEG-4 and maintaining the semantics of certain behaviors specified by SMIL for all cases can be difficult. It is however the authoring tools responsibility to maintain correct semantics during the mapping. To achieve satisfactory mappings the authoring tool may use all the power of the MPEG-4 representation, including Scripting and MPEG-J. It would, of course, limit the use of MPEG-4 tools to those included in the MPEG-4 profile and level for which the presentation is being created.

Mapping for the semantics are both static and dynamic in nature. Static mappings capture the semantics for deterministic behavior that can be fully evaluated at authoring time. Dynamic mappings require runtime support of MPEG-4 player mechanisms and would be utilized to support non-deterministic behavior such as unpredictable user interactions whose timing cannot be fully evaluated into a fixed static timeline.

Providing a detailed account of SMIL modules is beyond the scope of this paper. In this section we will focus on the media and timing modules as these provide the basic core of the format. Media and timing are also the foundation to SMIL 1.0 functions and hence to SMIL 2.0 functions as well.

4.2 Extensible Media (xMedia) Objects

SMIL provides a useful abstraction for media objects. However SMIL concerns itself with multimedia player composition rather than multimedia object composition. As such SMIL coordinates the temporal and spatial layout of media players but does not provide facilities to manipulate the internal properties of the media being played by the media players.

MPEG-4 is, however, very much focused on scene composition and the combination of audio-visual (multimedia) objects and the manipulation of their fundamental properties to create rich, interactive, dynamic presentations. For example SMIL would handle a text media object by passing the media data to a text player capable of handling the associated mime-type. It would let the text player concern itself about any fonts, style, kerning, colors etc., and the representation of the text and any attributes would be part of the internal media data structure for that mime-type and opaque to SMIL. However an MPEG-4 text object includes font, style, alignment, colors etc., and standardizes the representation and data streams so that an MPEG-4 player is intimately aware of the detail and fundamental properties of media objects.

MPEG-4 contains audio, image, video and text media as in SMIL. It also has 2D media elements similar to those that are described in Scalable Vector Graphics (SVG) [6], such as the <Rectangle> and <Circle> elements. SVG also uses modules defined by SMIL, for example the Content Selection and Animation modules. The Animation module was in fact a joint development between SMIL and SVG.

Recognizing both the similarities and differences between SMIL and MPEG-4 the XMT-O media elements are based on the SMIL media elements have been extended to include additional child elements to represent the fundamental properties of the media. Thus XMT-O defines a set of extensible media (xMedia) elements as basic building blocks, representing multimedia objects, that can be combined in complex spatial and temporal layouts and whose fundamental properties can be animated to create the rich, dynamic, interactive content that MPEG-4 is all about.

An xMedia object is defined by an element, such as <img>, or <rectangle>, which abstracts geometry containing media specific geometric property attributes as well as timing attributes, defined by SMIL, for temporal layout. Spatial properties of an xMedia object can be further defined by a set of common child elements, such as, <transformation>, <material>, <outline>, <chromakey>, <texture>, <light>, and <hotspots>. These child elements of an xMedia object can be used to define either 2D or 3D properties, as the elements provide a combined set of attributes for this purpose. The element is however for 2D only.

An xMedia element abstracts MPEG-4 systems and audio/visual streams and as such it is a high-level abstraction for the BIFS, OD Framework and media streams, etc., to which the XMT- O is mapped.

The following are examples of xMedia objects, where the img media object is fully compatible with SMIL.

     <img src="portrait.jpg"/>
     <circle radius="75"/>
     <rectangle size="320 240"/>
     <curve points="0 0; 10 12; 25 20; 15 14"/>
   

Also added for xMedia objects is the <hotspots> element that allows xMedia elements, such as circle, rectangle etc to be used as hotspot links. The <hotspots> element permits media elements to be added to any other media elements and designated as a hotspot. In MPEG-4, any shape can have a TouchSensor attached and therefore can be used as a hotspot. Both 2D and 3D objects can be used and so a wider range of shapes is available for use in comparison to the functionally similar <area> element in SMIL. The objects used as hotspots can be timed and also animated to provide dynamic, changing hotspots over time.

4.3 Animation

XMT-O incorporates the SMIL Animation modules to allow attributes of xMedia objects to be animated to create dynamic content. Animation features can be mapped to MPEG-4 in various ways. The simpler <set> element, that sets a value into an attribute, can be mapped to the MPEG-4 Replace Field BIFS update command. The more complex <animate>, <animateColor> and <animateMotion> can be mapped to MPEG-4 TimeSensors, Interpolators and Routes.

Animation elements are timed, like xMedia elements, using the same set of timing attributes. This allows a wide variety of timed dynamic behavior including event-based timing for the animation. The following example shows a red circle whose color is changed from one static value to another and then later is linearly interpolated between the colors yellow, blue and green.

     <circle radius="160" begin="6s" dur="35s">
       <material color="red">
          <set attributeName="color" to="orange" begin="3s" dur="6s"/>
          <animateColor attributeName="color" values="yellow; blue; green"
                        keyTimes="0; .4; 1" calcMode="linear"
                        begin="15s" dur="12s"/>
       </material>
     </circle>
   

4.4 Timing

XMT-O elements are temporally arranged and synchronized using the SMIL Timing and Synchronization modules. The terms SMIL timing and XMT-O timing (or simply XMT timing) are used interchangeably in this section but distinguished when necessary. The syntax and semantics of timing elements and attributes are according to the SMIL specification.

  • The <par> element plays one or more child elements allowing "parallel" playback.
  • The <seq> element plays the child elements one at a time in sequence.
  • The <excl> element plays one child at a time, but does not impose any order.

The Timing modules also specify attributes to control an element's timing behavior. The essential timing attributes are:

  • dur: specifies the duration of an element.
  • begin: specifies the begin time of an element in a variety of ways, ranging from simple clock times to event based occurrences, e.g. a mouse-click.
  • end: specifies the end time of an element in a variety of ways as per the begin time.
  • min: specifies the minimum active duration of an element.
  • max: specifies the maximum active duration of an element.

The following shows an example of the use of <par> and <seq> time containers. The <par> will begin at t=5s and at that time the rectangle and the polygons media objects in the <seq> will start playing. At t=1s the img will begin so that three media objects are playing in parallel. At t=10s the rectangle will end, and at t=12s the img will end leaving only the polygons media object playing. At t=15s the polygons object will end and the next element in the sequence, the circle, will begin to play for its 2s duration. At t=17s the sequence ends with the circle and the par ends too.

     <par begin="5s">
       <rectangle size="10 10" dur="5s"/>
       <img src="my.jpg" begin="1s" dur=6s"/>
       <seq>
         <polygons coord="10 10; 34 45; 23 12" dur="10s"/>
         <circle radius="50" dur="2s"/>
       </seq>
     </par>
   

XMT-O also supports timing attributes, such as repeatCount, restart and endsync to allow repeats, restart, and to force the time container to end when selected child element(s) end.

4.4.1 EventTiming

XMT-O, like SMIL, has event-based timing where the beginning or end of an object can be triggered by events. Events are key to supporting non-deterministic behavior to provide engaging interactive content, whether it is for 2D or 3D content.

As an example, the XMT-O fragment below shows a circle that begins playing (appears) when an object called myButton is clicked.

     <circle radius="240" begin="myButton.click">
       <material color="green"/>
     </circle>
   

Both 2D and 3D xMedia objects can generate events such as click, mouseup, mousedown, mouseout and mouseover. In addition to providing these basic events, XMT-O also support more advanced events concerned with the interaction of MPEG-4 objects in both 2D and 3D spaces, such as, near and collide.

4.5 XMT-O Examples

This section contains an example of the XMT-O high-level format and mapping into MPEG-4 systems.

4.5.1 Rectangle with finite duration color animation on mouse press

The following example shows a rectangle whose color changes over a 6s (second) duration beginning when the mouse button is pressed down.

Using the XMT-O format a rectangle is defined of size 50,50 (a square) whose mid-point is positioned at coordinate x=40 y=75, using a child transformation element. A child material element provides the rectangle with a color and specifies it's to be drawn as a filled shape rather than an outline. The material then has an animateColor child element that describes a linear interpolation using three, color values beginning when the mouse is clicked having a duration of 6s.

     <rectangle id="Square" size="50 50">
       <transformation visibility="true" translation="40 75"/>
       <material color="#ee0000" filled="true">
         <animateColor attributeName="color"
                       dur="6s" begin="Square.click"
                       values="#ee0000; #ffcc45; #ffffff"
                       keyTimes="0; 0.3; 1" calcMode="linear"/>
       </material>
     </rectangle>
   

Using XMT-A syntax to show how the XMT-O format example above could be potentially mapped to MPEG-4 nodes and routes follows. The necessary OD Framework elements, such as InitialObjectDescriptor have been omitted for clarity.

To create the scene we need a BIFS command to Replace Scene. To provide the root node for the scene a top-level node is needed. Here an OrderedGroup is used to create the necessary 2D context for the Rectangle. Then there is an MPEG-4 Switch node so the entire xMedia object can be hidden or shown (visibility="true" attribute above).

The XMT-O rectangle is the basic pattern of Shape containing appearance and geometry, where geometry is a Rectangle and appearance contains a Material2D describing the color. The Rectangle (Shape) is set under a Transform2D to position it. (This is opposite to XMT-O where the transformation is a child element of rectangle.)

To sense mouse activity a TouchSensor is needed and to define the duration of the color change a TimeSensor. Then there is a ColorInterpolator to animate the color. To make the behavior work the TouchSensor touchTime is routed to the TimeSensor startTime that starts the TimeSensor when the mouse is pressed. Then the fraction_changed output of the TimeSensor is routed to the set_fraction input of the ColorInterpolator; and finally the value_changed output of the ColorInterpolator is routed to the emissiveColor field of the Rectangles Material2D. Not forgetting of course to DEF the requisite nodes so that they can be identified for the routes.

     <Replace>
      <Scene>
        <OrderedGroup>
          <children>
            <Switch whichchoice="0">
              <choice>
                <Transform2D translation="40 75">
                  <children>
                    <Shape>
                      <appearance>
                        <Appearance>
                          <material>
                            <Material2D DEF="SquareMat"
                                        emissiveColor="0.93 0.0 0.0"
                                        filled="TRUE" />
                          </material>
                        </Appearance>
                      </appearance>
                      <geometry>
                        <Rectangle size="50 500"/>
                      </geometry>
                    </Shape>
                    <TouchSensor DEF="Touch" />
                    <TimeSensor  DEF="Timer" cycleInterval="6" />
                    <ColorInterpolator  DEF="Coloring"
                                        key="0.0 0.3 1.0"
                                        keyValue="0.93 0.0 0.0, 1.0 0.93 0.27, 1.0 1.0 1.0"/>
                  </children>
                </Transform2D>
              </choice>
            </Switch>
          </children>
        </OrderedGroup>

        <ROUTE fromNode="Touch"
               fromField="touchTime"
               toNode="Timer"
               toField="startTime"  />

        <ROUTE fromNode="Timer"
               fromField="fraction_changed"
               toNode="Coloring"
               toField="set_fraction />

        <ROUTE fromNode="Coloring"
               fromField="value_Changed"
               toNode="SquareMat"
               toField="emissiveColor" />
       </Scene>
     </Replace>
   

5. CROSS-STANDARD INTEROPERABILITY EXPERIMENTS

In this section we describe interoperability experiments we have conducted: one set with Web3D X3D and the other with W3C SMIL 2.0.

5.1 XMT-A/X3D Interoperability

A cross-standard interoperability experiment of XMT-A with X3D was conducted at the 54th MPEG meeting in La Baule, France, Oct. 2000. We used a set of X3D examples available from the Web3D site: http://www.web3d.org and performed the following steps to verify interoperability:

  1. Select an X3D example, and render the corresponding VRML file by a VRML browser.
  2. Replace the X3D header of the selected file with the XMT-A header (see below).
  3. Validate the XMT-A file against the XMT-A schema.
  4. Compile (or encode) the XMT-A file to an mp4 file.
  5. Render the mp4 by the MPEG-4 reference player.
  6. Verify that there are no practically observable differences between the two playbacks.

We were able to successfully verify interoperability of XMT with X3D on a subset of the X3D files to the extent the supported functionalities were compatible.

The table below compares the XMT-A and X3D representations to illustrate the high degree of compatibility and the small amount of changes necessary to go from X3D to MPEG-4 or vice versa; within the set of elements that are contained in both standards.


XMT-AX3D
  <Header>

    <meta>

    </meta>

    <InitialObjectDescriptor  
    .../>

  </Header>

  <Body>

    <Replace>

      <Scene>

        <!-- The scene
          contents -->

      </Scene>

    </Replace>

  </Body>
       
  <Header>

    <meta>

    </meta>




  </Header>





  <Scene>

    <!-- The scene
     contents -->

  </Scene>
      

To completely convert the document instance from X3D to XMT-A, or vice versa, the outer <X3D> or <XMT-A> element, with schema namespace references, needs to be altered accordingly.

Note that an X3D <Scene> does not need to have a <Group> at the top level, while MPEG-4 requires a top-level node such as <Group>, <OrderedGroup>, <Layer2D> or <Layer3D> as the root of the scene graph. If the X3D scene does not have a single <Group> at the root it will also be necessary to add this when converting to XMT-A.

Note also that in X3D image, video and audio sources are referred directly by urls. While MPEG-4 can express the urls in an identical manner it is more likely that a conversion would create ObjectDescriptors for these media types and replace the source url references by ObjectDescriptor Ids.

According to the above rules the following example illustrates the X3D -> XMT-A conversion by showing identical content first in X3D and then in XMT-A. The emboldened blue text highlights the differences between them.

The content comprises a singlered box located at x, y, z coordinates -3, 0, 0.

     <X3D>
       <Scene>
         <Group>
           <children>
             <Transform DEF="ABox" translation="-3.0 0.0 0.0">
               <children>
                 <Shape>
                   <geometry>
                     <Box/>
                   </geometry>
                   <appearance>
                     <Appearance DEF="App">
                       <material>
                         <Material diffuseColor="1.0 0.0 0.0"/>
                       </material>
                     </Appearance>
                   </appearance>
                 </Shape>
               </children>
             </Transform>
           </children>
         </Group>
       </Scene>
    </X3D>
   

To convert the X3D to XMT-A an InitialObjectDescriptor is added into a Header section, and a Body section is added with a Replace command to contain the Scene. The InitialObjectDescriptor contains information to locate the BIFS stream and configuration for the BIFS decoder.

     <XMTA>
       <Header>
         <InitialObjectDescriptor>
           <ProfDescr>
             <esDescr>
               <ES_Descriptor>
                 <decConfigDescr>
                   <DecoderConfigDescriptor bufferSizeDB="auto"
                                objectTypeIndication="1" streamType="3">
                     <decSpecificInfo>
                       <BIFSConfig nodeIDbits="auto"
                                pixelWidth="480"  pixelHeight="320"
                                pixelMetric="true"  routeIDbits="auto"/>
                     </decSpecificInfo>
                   </DecoderConfigDescriptor>
                 </decConfigDescr>
                 <slConfigDescr>
                   <SLConfigDescriptor timeStampLength="auto"
                                          timeStampResolution="auto"
                                          useAccessUnitStartFlag="true"/>
                 </slConfigDescr>
               </ES_Descriptor>
             </esDescr>
           </ProfDescr>
         </InitialObjectDescriptor>
       </Header>
       <Body>
         <Replace>
           <Scene>
             <Group>
               <children>
                 <Transform DEF="ABox" translation="-3.0 0.0 0.0">
                   <children>
                     <Shape>
                       <geometry>
                         <Box/>
                       </geometry>
                       <appearance>
                         <Appearance DEF="App">
                           <material>
                             <Material diffuseColor="1.0 0.0 0.0"/>
                           </material>
                         </Appearance>
                       </appearance>
                     </Shape>
                   </children>
                 </Transform>
               </children>
             </Group>
           </Scene>
         </Replace>
       </Body>
     </XMT-A>
   

5.2 Interoperability with SMIL

A cross-standard interoperability experiment of XMT-O with SMIL 2.0 was conducted at the W3C SYMM (SMIL 2.0) Working Group in Cambridge, MA, USA, in March 2001. We used a set of SMIL 2.0 test cases and performed the following steps to verify interoperability:

  1. Select a SMIL example, and render it on a SMIL player.
  2. Replace the SMIL header of the selected file with the XMT-O header (see below).
  3. Validate the XMT-O file against the XMT-O schema.
  4. Compile (or encode) the XMT-O file to an XMT-A file, and encode it to mp4.
  5. Render the mp4 by the MPEG-4 player.
  6. Verify that there are no practically observable differences between the two playbacks.

We were able to successfully verify interoperability of XMT-O with SMIL 2.0 on of a subset of the SMIL files to the extent that the supported functionalities were compatible.

A further cross-standard interoperability experiment of XMT-O with SMIL 2.0 was conducted at the 56th MPEG meeting in Singapore, March. 2001 as follows:

  1. Select an XMT-O example, with MPEG-4 specific functionalities (See below).
  2. Replace the XMT-O header of the selected file with the SMIL header (see below).
  3. Render the SMIL file on a SMIL player.
  4. Compile (or encode) the XMT-O file to XMT-A file, and encode it to mp4.
  5. Render the mp4 by the MPEG-4 player.
  6. Verify that there are no practically observable differences between the two playbacks.

This experiment was also successful. The SMIL player used for this experiment was able to ignore MPEG-4 specific elements and rendered compatible functionalities as expected.

The following example shows SMIL -> XMT-O conversion. The emboldened example text highlights the differences, which are minimal.

The content is a slideshow presentation comprising three images in a sequence. Each image is shown for a maximum of 10s and a minimum of 3s. The slideshow will play for 30s without any user interaction (3 images of 10s). If the user clicks an image after 3s then the activateEvent (event from the click) will end the image immediately and moves the presentation to the next one. If clicked before the minimum of 3s then it continues until it has been shown for 3s and only then does the presentation move to the next one.

     <smil>
       <body>
         <seq>
           <img src="Slide1.jpg" max="10s" min="3s" end="activateEvent" />
           <img src="Slide2.jpg" max="10s" min="3s" end="activateEvent" />
           <img src="Slide3.jpg" max="10s" min="3s" end="activateEvent" />
         </seq>
       </body>
     </smil>
   

To convert the SMIL example above to XMT-O the <smil> element is replaced by an <XMT-O> element. Schema namespace definitions have been omitted for clarity.

     <XMT-O>
       <body>
         <seq>
           <img src="Slide1.jpg" max="10s" min="3s" end="activateEvent" />
           <img src="Slide2.jpg" max="10s" min="3s" end="activateEvent" />
           <img src="Slide3.jpg" max="10s" min="3s" end="activateEvent" />
         </seq>
       </body>
     </XMT-O>
   

6. CONCLUSION

This paper described the Extensible MPEG-4 Textual Format (XMT) framework. The XMT framework consists of two levels of textual syntax and semantics: the XMT-A format, which provides a one-to-one textual format representation of the MPEG-4 binary constructs, and the XMT-O format which provides a high-level abstraction to content authors so they can exchange the content with other authors while preserving the original intent. The XMT-A provides interoperability between VRML/X3D and MPEG-4, and the XMT-O provides interoperability between SMIL and MPEG-4.

7. REFERENCES

[1]ISO/IEC FDIS 14496, Information Technology -- Generic Coding of Audio-Visual Objects - Part 1: System, Part 2: Visual, Part 3: Audio, Part 6: DMIF, International Organization for Standardization, 1998.
[2]ISO/IEC FDIS 14772:200x, Information Technology - Computer graphics and image processing - The Virtual Reality Modeling Language (VRML).
[3]"Synchronized Multimedia Integration Language (SMIL) 1.0 Specification" W3C Recommendation, http://www.w3.org/TR/REC-smil/ (June 1998).
[4]ISO/IEC FDIS 14772-1:1997, Information Technology - Computer graphics and image processing - The Virtual Reality Modeling Language (VRML) - Part 1: Functional specification and UTF-8 encoding.
[5]"MPEG-7: Context, Objectives and Technical Roadmap, V.12", ISO/IEC JTC1/SC29/WG11 N2861 (July 1999).
[6]"Scalable Vector Graphics (SVG) 1.0 Specification", W3C Working Draft, http://www.w3.org/TR/2000/03/WD-SVG-20000303/index.html (March 2000).



    About IBMPrivacyContact