|
This paper presents an overview of XMT.
XMT: MPEG-4 Textual Format
for Cross-Standard Interoperability
|
Michelle Kim
IBM T.J Watson Research
19 Skyline Drive
Hawthorne, NY. 10532 USA
1-914-784-7709
mykim@us.ibm.com
|
Steve Wood
IBM T.J Watson Research
19 Skyline Drive
Hawthorne, NY. 10532 USA
1-914-784-7309
woodsp@us.ibm.com
|
ABSTRACT
This paper describes the Extensible MPEG-4 Textual format (XMT) that is a work-in-progress standardization effort
targeted to be published as MPEG-4 Systems 2001 Edition Amendment 2) and is a framework for representing MPEG-4
content using a textual syntax. The XMT allows content authors to exchange their content with other authors, tools,
or service providers, and facilitates interoperability with both the X3D, developed by the Web3D consortium, and the
Synchronized Multimedia Integration Language (SMIL) 2.0 from the W3C consortium.
Categories and Subject Descriptors
[Multimedia Tools] Composite multimedia coding standards, interoperability
General Terms
Standardization, Languages.
Keywords
MPEG-4, VRML, SMIL, interoperability.
1. INTRODUCTION
MPEG-4 [1] is an ISO/IEC standard developed by MPEG (Moving Picture Experts Group) for
representing and
compressing interactive, audio-visual scenes. The standard defines a set of tools for the coded representation of
individual audio-visual objects, text/graphics and synthetic objects. An MPEG-4 scene defines the interactive
behavior of these coded objects and the way they are composed in space and time. The scene description is coded in a
binary format known as BIFS (BInary Format for Scenes).
The audio-visual streams in MPEG-4 are called Elementary Streams whose coding is defined in the Object Descriptor
(OD) Framework that describes the streams and their relationships to the scene and with each other. The OD Framework
also defines streams for Object Content Information (OCI), MPEG-J (Java-based programs known as MPEGlets), and
Intellectual Property Management and Protection (IPMP).
The MPEG-4 scene description, coded in BIFS, is sent to the receiving terminal along with other coded visual/audio
objects in further elementary streams. MPEG-4 provides a scene update mechanism to modify the scene dynamically at
the receiver and an OD update mechanism to dynamically update Object Descriptors to add or remove media stream
objects.
The MPEG-4 specification includes parts for Systems, Visual and Audio. The XMT is a textual representation, using
XML, of MPEG-4 Systems (ISO/IEC 14496-1). The MPEG-4 specification provides conformance points to ensure
interoperability of bitstreams with receivers so that they can decode and render the content. MPEG-4 allows the
representation of content that is both complex and highly interactive, as well as, of course, much simpler content.
Binary coded MPEG-4 content can be stored in the MPEG-4 Systems file format commonly known as the mp4 format. While
the mp4 file is exchangeable it is often difficult to subsequently use it to edit or re-purpose the stored content.
This is because the binary coded representation often cannot be "reverse-engineered" in a consistent manner to
represent the content author's original intentions.
The XMT has been designed to provide an exchangeable format between content authors while preserving the author's
intentions in a high-level textual format. In addition to providing an author-friendly abstraction of the underlying
MPEG-4 technologies, another important consideration for the XMT design was to respect existing practices of content
authors such as the Web3D X3D, W3C SMIL and HTML.
The XMT is suitable for many uses including manually authored content as well as machine-generated content using
multimedia database material and templates. The XMT may be encoded and stored in the exchangeable mp4 binary file or
may also be encoded directly into streams and transmitted. XMT encoding and delivery hints exist to assist this
process.
1.1 Cross-Standard Interoperability
The XMT format facilitates content interchange between SMIL players, Virtual Reality Modeling
Language (VRML) [4] players, and MPEG-4 players. The XMT-O format can be preprocessed and played directly by a SMIL
player, preprocessed to the corresponding X3D nodes and played back by a VRML/X3D player, or compiled to a binary
MPEG-4 representation, such as mp4, and played by an MPEG-4 player.
Figure 1 presents a graphical portrayal of the interoperability of the XMT. MPEG-7 [5], as shown
in the figure, is an emerging standard for describing multimedia content. The integration of MPEG-7 with XMT, which
is currently work in progress, can enable the content-based retrieval of MPEG-4 objects.
Figure 1. - Interoperability of XMT with other standards
In this paper we provide an overview of the XMT, with a focus on its cross-standard
interoperability with the Web3D X3D and the W3C SMIL 2.0. In section 2, we describe the two-tier XMT architecture.
XMT-A is summarized in Section 3, and XMT-O is summarized in Section 4. We present in Section 5 cross-standard
interoperability experiments that have been conducted, and Section 6 concludes the paper.
2. XMT TWO-TIER ARCHITECTURE
The XMT consists of two levels of textual syntax and semantics: the XMT-O format and the XMT-A
format.
The XMT-A is an XML-based version of MPEG-4 content that closely mirrors its binary
representation. The goals of the XMT-A format are to provide a textual representation of MPEG-4 Systems binary
coding, including supporting a deterministic one-to-one mapping to the binary representations for conformance, and to
be interoperable with the X3D [[2]], which is being developed for VRML 200x (X3D) by the Web 3D Consortium.
XMT-A contains a textual representation of BIFS for both the nodes and the commands for scene
updates. The XMT textual format for the nodes is fully compatible with the XML representation of the subset of X3D
that is contained within MPEG-4. MPEG-4 Systems scene description originally used VRML 97 as a base and added
further nodes of synthetic representation, audio and 2D support. VRML 97 is an ASCII textual representation while
X3D is based on XML. Currently the X3D specification has additional function that is not present in MPEG-4; hence
XMT contains its subset rather than complete coverage. The XML for additional nodes in MPEG-4 has been based on the
same representation philosophy defined by X3D that originally guided their XML conversion from VRML plain text. This
strategy will facilitate change if future harmonization efforts between MPEG-4 and X3D make more functionalities
common.
XMT-A also has textual representations of features unique to MPEG-4 Systems, not found in X3D,
such as the Object Descriptor Framework that includes ODs (Object Descriptors), other descriptors and commands. The
ODs are used to associate scene description components to the Elementary Streams that contain the corresponding coded
data. These include visual streams, audio streams as well as animation streams that update elements of the scene
more efficiently than BIFS commands for complex animations. XMT-A also includes a textual representation to allow
MPEG-J streams to be created either from Java classes or zip files.
The XMT-O is a high-level abstraction of MPEG-4 features designed around the W3C SMIL 2.0 [[3]],
an XML-based language that allows authors to create dynamic, interactive multimedia presentations. The W3C
Synchronized Multimedia (SYMM) working group, at this time of writing, is however still developing the SMIL 2.0
standard, the successor to SMIL 1.0. Also, as the MPEG standardization process is not yet complete for XMT, the XMT
syntax and samples as provided in this paper are based on the latest version of the XMT specification.
Using XMT-O authors can describe the temporal behavior and layout media objects to form
multimedia presentations, as well as associate hyperlinks with media objects in the presentation. Animation and
interactive event based behavior can also be easily expressed, as well as transition effects. Unlike SMIL the XMT
has features that enable control over many intrinsic properties of the media objects in support of the capabilities
present in MPEG-4.
For each XMT-O construct, one or more possible mappings exist to MPEG-4 Systems that can
represent the same content. Such mappings may be represented in XMT-A by various combinations and sequences of XMT-A
elements. From this discussion it can be concluded that there is no single deterministic mapping between the two
levels, for the obvious reason that a high-level author's intentions can be expanded to more than one sequence of
low-level constructs. However, with such non-determinism comes flexibility. XMT-O content can thus be converted and
targeted to MPEG-4 players of varying capability. For example, animation in a scene may be mapped using BIFS
commands for a simple player, for a more capable player the animation may be mapped, potentially more efficiently, to
an animation stream.
Within XMT-A and XMT-O a set of common elements such as MPEG-7, authoring elements, encoding
hints, delivery hints and publication hints have been identified and collected into a common section used within both
formats that is designated XMT-C.
3. XMT-A FORMAT
The XMT-A format is a direct textual representation of MPEG-4 Systems which, at its core, has a
compatible scene graph representation of nodes that MPEG-4 has in common with X3D, developed by the Web3D consortium.
Such compatibility facilitates content interchange and interoperability with X3D.
XMT-A contains a textual representation for the MPEG-4 Systems BIFS (BInary Format for Scenes) as
well as the Object Descriptor Framework. The scene description declares the spatial-temporal relationship of audio,
visual, graphic and synthetic objects. This includes the nodes for the scene description, routes that connect fields
between nodes and BIFS commands to deliver and update the scene.
The Object Description Framework specifies the elementary stream resources that provide the
time-varying data for the scene and consists of a set of descriptors that allows the identification, description and
association of elementary streams to objects in the scene description and to each other. A textual representation of
the MPEG-4 Object Descriptor Framework includes the descriptors, commands to update the descriptors and also
representation for elementary stream data.
Elementary stream descriptors contain information about the stream type, configuration
information for the decoding process and dependencies between streams for scaleable, layered codecs. Alternate
stream representations may be present and language descriptors describe language-based alternatives.
Some elementary streams, such as BIFS and OD, have textual representations to directly create the
streams via an encoding process. Other elementary streams, such as video and audio, have no textual representation
of the media itself and external media sources are referenced to provide the stream data itself.
When MPEG-4 content is represented as a document using XMT-A, the XML document instance as it's
known, will be structured according to defined rules for element placement. These rules are specified using the
Schema language, developed by the W3C, to provide the definition of XMT-A.
When encoding XMT-A into binary MPEG-4 allows alternate coding schemes, e.g. list versus vector,
which will produce alternate binary representations that are all legally valid. XMT-A provides a set of encoding
rules to ensure content is coded consistently and also to support deterministic coding for conformance.
4. XMT-O FORMAT
The goals of the XMT-O format are to provide ease of use for content creation, to facilitate
content exchange among authors and authoring tools, and to provide a content representation that is compatible and
interoperable with the Synchronized Multimedia Integration Language (SMIL) 2.0.
XMT-O describes audio-visual objects and their relationships at a high level, where content
requirements are expressed in terms of the author's intent, using media, timing and animation abstractions, rather
than by coding explicit nodes and route connections. Higher-level constructs facilitate, among other aspects,
content exchange between authors and authoring tools, and content re-purposing.
An algorithm, or an authoring tool, would compile (or map) an XMT-O construct into MPEG-4
content, i.e., into BIFS, OD and media streams etc., along with any appropriate audio/visual media compression or
conversions that may be required. Media sources can be of a variety of formats native to the machine where the
algorithm is executing, and it is the responsibility of the tool, during the compilation phase, to convert media to
suitable target formats for MPEG-4 and appropriate bit-rates etc.
In converting the XMT-O format to MPEG-4 there is not necessarily only one mapping for each
construct. MPEG-4 nodes and routes are very powerful tools and there can often be more than one way to represent
XMT-O constructs. Also, as MPEG-4 nodes can be 'wired' together with routes in many combinations, it is often
difficult to reverse engineer an author's intent from a collection of nodes and routes. When confronted by content,
containing many nodes and routes, the re-authoring and maintenance can be quite challenging, if the high-level view
of that presentation must be inferred. The XMT-O, however, provides a high-level view with high-level authoring
constructs and thus facilitates content exchange, rapid content re-purposing, re-authoring and ongoing maintenance of
content.
Recognizing though that some authors may wish to access low-level nodes/routes, XMT-O allows the
embedding of the XMT-A node and route definitions, within an identified low-level escape section, to create custom
media constructs.
4.1 Re-using SMIL in XMT-O
To create the XMT-O format, as a high-level abstraction and representation of content authors'
intentions, SMIL is used as a basis. SMIL is an XML-based language that allows authors to write interactive
multimedia presentations. The main strengths of SMIL are that its constructs are self-describing, it is based on XML
which provides an excellent format for interchange of data among different applications, it is relatively easy to
author, and it is a language familiar to many content creators. The language is also extensible so that new objects
or metadata can be inserted easily into the representation.
The XMT-O format re-uses as a base a subset of the modules defined by SMIL where the functions
and semantics are compatible. To that base a new set of elements have been designed explicitly for XMT-O that
express a high-level view of MPEG-4 specific content. The XMT-O format has not specifically been designed as a
playback format, unlike SMIL, rather it is intended that the constructs are mapped and compiled to a binary MPEG-4
representation. However XMT-O may also be processed for exchange with SMIL or mapped to X3D within the limits of
compatible capabilities.
To enable reuse of SMIL defined functions, the SMIL language has been composed into number of
functional areas that have been broken down to a finer granularity using modules. These modules, comprising XML
element, attributes, and attribute values, can then be combined and brought together in other host languagessuch as
XMT-O. XMT-O is referred to as a host language since it integrates, or hosts, the modules within a larger set of XML
representation. SMIL provides requirements and guidelines for integration of these modules into other host languages
and. XMT-O follows these rules, as far as possible, to preserve compatibility with SMIL and both adhere to the
semantics of the modules as well as their syntax. SMIL is a language that has been designed and then implementations
for it follow. MPEG-4 however, is as existing binary specification and implementation, and XMT-O is being developed
as a high-level language to represent it.
XMT-O must be mapped (compiled) into MPEG-4 and maintaining the semantics of certain behaviors
specified by SMIL for all cases can be difficult. It is however the authoring tools responsibility to maintain
correct semantics during the mapping. To achieve satisfactory mappings the authoring tool may use all the power of
the MPEG-4 representation, including Scripting and MPEG-J. It would, of course, limit the use of MPEG-4 tools to
those included in the MPEG-4 profile and level for which the presentation is being created.
Mapping for the semantics are both static and dynamic in nature. Static mappings capture the
semantics for deterministic behavior that can be fully evaluated at authoring time. Dynamic mappings require runtime
support of MPEG-4 player mechanisms and would be utilized to support non-deterministic behavior such as unpredictable
user interactions whose timing cannot be fully evaluated into a fixed static timeline.
Providing a detailed account of SMIL modules is beyond the scope of this paper. In this section
we will focus on the media and timing modules as these provide the basic core of the format. Media and timing are
also the foundation to SMIL 1.0 functions and hence to SMIL 2.0 functions as well.
4.2 Extensible Media (xMedia) Objects
SMIL provides a useful abstraction for media objects. However SMIL concerns itself with
multimedia player composition rather than multimedia object composition. As such SMIL coordinates the temporal and
spatial layout of media players but does not provide facilities to manipulate the internal properties of the media
being played by the media players.
MPEG-4 is, however, very much focused on scene composition and the combination of audio-visual
(multimedia) objects and the manipulation of their fundamental properties to create rich, interactive, dynamic
presentations. For example SMIL would handle a text media object by passing the media data to a text player capable
of handling the associated mime-type. It would let the text player concern itself about any fonts, style, kerning,
colors etc., and the representation of the text and any attributes would be part of the internal media data structure
for that mime-type and opaque to SMIL. However an MPEG-4 text object includes font, style, alignment, colors etc.,
and standardizes the representation and data streams so that an MPEG-4 player is intimately aware of the detail and
fundamental properties of media objects.
MPEG-4 contains audio, image, video and text media as in SMIL. It also has 2D media elements
similar to those that are described in Scalable Vector Graphics (SVG) [6], such as the <Rectangle> and <Circle>
elements. SVG also uses modules defined by SMIL, for example the Content Selection and Animation modules. The
Animation module was in fact a joint development between SMIL and SVG.
Recognizing both the similarities and differences between SMIL and MPEG-4 the XMT-O media
elements are based on the SMIL media elements have been extended to include additional child elements to represent
the fundamental properties of the media. Thus XMT-O defines a set of extensible media (xMedia) elements as basic
building blocks, representing multimedia objects, that can be combined in complex spatial and temporal layouts and
whose fundamental properties can be animated to create the rich, dynamic, interactive content that MPEG-4 is all
about.
An xMedia object is defined by an element, such as <img>, or <rectangle>, which
abstracts
geometry containing media specific geometric property attributes as well as timing attributes, defined by SMIL, for
temporal layout. Spatial properties of an xMedia object can be further defined by a set of common child elements,
such as, <transformation>, <material>, <outline>, <chromakey>, <texture>, <light>, and <hotspots>. These child
elements of an xMedia object can be used to define either 2D or 3D properties, as the elements provide a combined set
of attributes for this purpose. The element is however for 2D only.
An xMedia element abstracts MPEG-4 systems and audio/visual streams and as such it is a
high-level abstraction for the BIFS, OD Framework and media streams, etc., to which the XMT- O is mapped.
The following are examples of xMedia objects, where the img media object is fully compatible with
SMIL.
<img src="portrait.jpg"/>
<circle radius="75"/>
<rectangle size="320 240"/>
<curve points="0 0; 10 12; 25 20; 15 14"/>
Also added for xMedia objects is the <hotspots> element that allows xMedia elements, such
as circle, rectangle etc to be used as hotspot links. The <hotspots> element permits media elements to be
added to any other media elements and designated as a hotspot. In MPEG-4, any shape can have a TouchSensor attached
and therefore can be used as a hotspot. Both 2D and 3D objects can be used and so a wider range of shapes is
available for use in comparison to the functionally similar <area> element in SMIL. The objects used as
hotspots can be timed and also animated to provide dynamic, changing hotspots over time.
4.3 Animation
XMT-O incorporates the SMIL Animation modules to allow attributes of xMedia objects to be
animated to create dynamic content. Animation features can be mapped to MPEG-4 in various ways. The simpler
<set> element, that sets a value into an attribute, can be mapped to the MPEG-4 Replace Field BIFS update
command. The more complex <animate>, <animateColor> and <animateMotion> can be mapped to MPEG-4
TimeSensors, Interpolators and Routes.
Animation elements are timed, like xMedia elements, using the same set of timing attributes.
This allows a wide variety of timed dynamic behavior including event-based timing for the animation. The following
example shows a red circle whose color is changed from one static value to another and then later is linearly
interpolated between the colors yellow, blue and green.
<circle radius="160" begin="6s" dur="35s">
<material color="red">
<set attributeName="color" to="orange" begin="3s" dur="6s"/>
<animateColor attributeName="color" values="yellow; blue; green"
keyTimes="0; .4; 1" calcMode="linear"
begin="15s" dur="12s"/>
</material>
</circle>
4.4 Timing
XMT-O elements are temporally arranged and synchronized using the SMIL Timing and Synchronization
modules. The terms SMIL timing and XMT-O timing (or simply XMT timing) are used interchangeably in this section but
distinguished when necessary. The syntax and semantics of timing elements and attributes are according to the SMIL
specification.
- The <par> element plays one or more child elements allowing "parallel" playback.
- The <seq> element plays the child elements one at a time in sequence.
- The <excl> element plays one child at a time, but does not impose any order.
The Timing modules also specify attributes to control an element's timing behavior. The
essential timing attributes are:
- dur: specifies the duration of an element.
- begin: specifies the begin time of an element in a variety of ways, ranging from simple clock times to
event based occurrences, e.g. a mouse-click.
- end: specifies the end time of an element in a variety of ways as per the begin time.
- min: specifies the minimum active duration of an element.
- max: specifies the maximum active duration of an element.
The following shows an example of the use of <par> and <seq> time containers. The
<par> will begin at t=5s and at that time the rectangle and the polygons media objects in the <seq> will
start playing. At t=1s the img will begin so that three media objects are playing in parallel. At t=10s the
rectangle will end, and at t=12s the img will end leaving only the polygons media object playing. At t=15s the
polygons object will end and the next element in the sequence, the circle, will begin to play for its 2s duration.
At t=17s the sequence ends with the circle and the par ends too.
<par begin="5s">
<rectangle size="10 10" dur="5s"/>
<img src="my.jpg" begin="1s" dur=6s"/>
<seq>
<polygons coord="10 10; 34 45; 23 12" dur="10s"/>
<circle radius="50" dur="2s"/>
</seq>
</par>
XMT-O also supports timing attributes, such as repeatCount, restart and endsync to allow
repeats, restart, and to force the time container to end when selected child element(s) end.
4.4.1 EventTiming
XMT-O, like SMIL, has event-based timing where the beginning or end of an object can be triggered
by events. Events are key to supporting non-deterministic behavior to provide engaging interactive content, whether
it is for 2D or 3D content.
As an example, the XMT-O fragment below shows a circle that begins playing (appears) when an
object called myButton is clicked.
<circle radius="240" begin="myButton.click">
<material color="green"/>
</circle>
Both 2D and 3D xMedia objects can generate events such as click, mouseup, mousedown, mouseout and
mouseover. In addition to providing these basic events, XMT-O also support more advanced events concerned with the
interaction of MPEG-4 objects in both 2D and 3D spaces, such as, near and collide.
4.5 XMT-O Examples
This section contains an example of the XMT-O high-level format and mapping into MPEG-4
systems.
4.5.1 Rectangle with finite duration color animation on mouse press
The following example shows a rectangle whose color changes over a 6s (second) duration
beginning when the mouse button is pressed down.
Using the XMT-O format a rectangle is defined of size 50,50 (a square) whose mid-point is
positioned at coordinate x=40 y=75, using a child transformation element. A child material element provides the
rectangle with a color and specifies it's to be drawn as a filled shape rather than an outline. The material then
has an animateColor child element that describes a linear interpolation using three, color values beginning when the
mouse is clicked having a duration of 6s.
<rectangle id="Square" size="50 50">
<transformation visibility="true" translation="40 75"/>
<material color="#ee0000" filled="true">
<animateColor attributeName="color"
dur="6s" begin="Square.click"
values="#ee0000; #ffcc45; #ffffff"
keyTimes="0; 0.3; 1" calcMode="linear"/>
</material>
</rectangle>
Using XMT-A syntax to show how the XMT-O format example above could be potentially mapped to
MPEG-4 nodes and routes follows. The necessary OD Framework elements, such as InitialObjectDescriptor have been
omitted for clarity.
To create the scene we need a BIFS command to Replace Scene. To provide the root node for the
scene a top-level node is needed. Here an OrderedGroup is used to create the necessary 2D context for the Rectangle.
Then there is an MPEG-4 Switch node so the entire xMedia object can be hidden or shown (visibility="true" attribute
above).
The XMT-O rectangle is the basic pattern of Shape containing appearance and geometry, where
geometry is a Rectangle and appearance contains a Material2D describing the color. The Rectangle (Shape) is set
under a Transform2D to position it. (This is opposite to XMT-O where the transformation is a child element of
rectangle.)
To sense mouse activity a TouchSensor is needed and to define the duration of the color change a
TimeSensor. Then there is a ColorInterpolator to animate the color. To make the behavior work the TouchSensor
touchTime is routed to the TimeSensor startTime that starts the TimeSensor when the mouse is pressed. Then the
fraction_changed output of the TimeSensor is routed to the set_fraction input of the ColorInterpolator; and finally
the value_changed output of the ColorInterpolator is routed to the emissiveColor field of the Rectangles Material2D.
Not forgetting of course to DEF the requisite nodes so that they can be identified for the routes.
<Replace>
<Scene>
<OrderedGroup>
<children>
<Switch whichchoice="0">
<choice>
<Transform2D translation="40 75">
<children>
<Shape>
<appearance>
<Appearance>
<material>
<Material2D DEF="SquareMat"
emissiveColor="0.93 0.0 0.0"
filled="TRUE" />
</material>
</Appearance>
</appearance>
<geometry>
<Rectangle size="50 500"/>
</geometry>
</Shape>
<TouchSensor DEF="Touch" />
<TimeSensor DEF="Timer" cycleInterval="6" />
<ColorInterpolator DEF="Coloring"
key="0.0 0.3 1.0"
keyValue="0.93 0.0 0.0, 1.0 0.93 0.27, 1.0 1.0 1.0"/>
</children>
</Transform2D>
</choice>
</Switch>
</children>
</OrderedGroup>
<ROUTE fromNode="Touch"
fromField="touchTime"
toNode="Timer"
toField="startTime" />
<ROUTE fromNode="Timer"
fromField="fraction_changed"
toNode="Coloring"
toField="set_fraction />
<ROUTE fromNode="Coloring"
fromField="value_Changed"
toNode="SquareMat"
toField="emissiveColor" />
</Scene>
</Replace>
5. CROSS-STANDARD INTEROPERABILITY EXPERIMENTS
In this section we describe interoperability experiments we have conducted: one set with Web3D
X3D and the other with W3C SMIL 2.0.
5.1 XMT-A/X3D Interoperability
A cross-standard interoperability experiment of XMT-A with X3D was conducted at the 54th MPEG
meeting in La Baule, France, Oct. 2000. We used a set of X3D examples available from the Web3D site:
http://www.web3d.org and performed the following steps to verify interoperability:
- Select an X3D example, and render the corresponding VRML file by a VRML browser.
- Replace the X3D header of the selected file with the XMT-A header (see below).
- Validate the XMT-A file against the XMT-A schema.
- Compile (or encode) the XMT-A file to an mp4 file.
- Render the mp4 by the MPEG-4 reference player.
- Verify that there are no practically observable differences between the two playbacks.
We were able to successfully verify interoperability of XMT with X3D on a subset of the X3D
files to the extent the supported functionalities were compatible.
The table below compares the XMT-A and X3D representations to illustrate the high degree of
compatibility and the small amount of changes necessary to go from X3D to MPEG-4 or vice versa; within the set of
elements that are contained in both standards.
| XMT-A | X3D |
<Header>
<meta>
</meta>
<InitialObjectDescriptor
.../>
</Header>
<Body>
<Replace>
<Scene>
<!-- The scene
contents -->
</Scene>
</Replace>
</Body>
|
<Header>
<meta>
</meta>
</Header>
<Scene>
<!-- The scene
contents -->
</Scene>
|
To completely convert the document instance from X3D to XMT-A, or vice versa, the outer
<X3D> or <XMT-A> element, with schema namespace references, needs to be altered accordingly.
Note that an X3D <Scene> does not need to have a <Group> at the top level, while
MPEG-4 requires a top-level node such as <Group>, <OrderedGroup>, <Layer2D> or <Layer3D> as
the root of the scene graph. If the X3D scene does not have a single <Group> at the root it will also be
necessary to add this when converting to XMT-A.
Note also that in X3D image, video and audio sources are referred directly by urls. While MPEG-4
can express the urls in an identical manner it is more likely that a conversion would create ObjectDescriptors for
these media types and replace the source url references by ObjectDescriptor Ids.
According to the above rules the following example illustrates the X3D -> XMT-A conversion by
showing identical content first in X3D and then in XMT-A. The emboldened blue text highlights the differences between
them.
The content comprises a singlered box located at x, y, z coordinates -3, 0, 0.
<X3D>
<Scene>
<Group>
<children>
<Transform DEF="ABox" translation="-3.0 0.0 0.0">
<children>
<Shape>
<geometry>
<Box/>
</geometry>
<appearance>
<Appearance DEF="App">
<material>
<Material diffuseColor="1.0 0.0 0.0"/>
</material>
</Appearance>
</appearance>
</Shape>
</children>
</Transform>
</children>
</Group>
</Scene>
</X3D>
To convert the X3D to XMT-A an InitialObjectDescriptor is added into a Header section, and a Body
section is added with a Replace command to contain the Scene. The InitialObjectDescriptor contains information to
locate the BIFS stream and configuration for the BIFS decoder.
<XMTA>
<Header>
<InitialObjectDescriptor>
<ProfDescr>
<esDescr>
<ES_Descriptor>
<decConfigDescr>
<DecoderConfigDescriptor bufferSizeDB="auto"
objectTypeIndication="1" streamType="3">
<decSpecificInfo>
<BIFSConfig nodeIDbits="auto"
pixelWidth="480" pixelHeight="320"
pixelMetric="true" routeIDbits="auto"/>
</decSpecificInfo>
</DecoderConfigDescriptor>
</decConfigDescr>
<slConfigDescr>
<SLConfigDescriptor timeStampLength="auto"
timeStampResolution="auto"
useAccessUnitStartFlag="true"/>
</slConfigDescr>
</ES_Descriptor>
</esDescr>
</ProfDescr>
</InitialObjectDescriptor>
</Header>
<Body>
<Replace>
<Scene>
<Group>
<children>
<Transform DEF="ABox" translation="-3.0 0.0 0.0">
<children>
<Shape>
<geometry>
<Box/>
</geometry>
<appearance>
<Appearance DEF="App">
<material>
<Material diffuseColor="1.0 0.0 0.0"/>
</material>
</Appearance>
</appearance>
</Shape>
</children>
</Transform>
</children>
</Group>
</Scene>
</Replace>
</Body>
</XMT-A>
5.2 Interoperability with SMIL
A cross-standard interoperability experiment of XMT-O with SMIL 2.0 was conducted at the W3C SYMM (SMIL 2.0) Working
Group in Cambridge, MA, USA, in March 2001. We used a set of SMIL 2.0 test cases and performed the following steps to
verify interoperability:
- Select a SMIL example, and render it on a SMIL player.
- Replace the SMIL header of the selected file with the XMT-O header (see below).
- Validate the XMT-O file against the XMT-O schema.
- Compile (or encode) the XMT-O file to an XMT-A file, and encode it to mp4.
- Render the mp4 by the MPEG-4 player.
- Verify that there are no practically observable differences between the two playbacks.
We were able to successfully verify interoperability of XMT-O with SMIL 2.0 on of a subset of the
SMIL files to the extent that the supported functionalities were compatible.
A further cross-standard interoperability experiment of XMT-O with SMIL 2.0 was conducted at the
56th MPEG meeting in Singapore, March. 2001 as follows:
- Select an XMT-O example, with MPEG-4 specific functionalities (See below).
- Replace the XMT-O header of the selected file with the SMIL header (see below).
- Render the SMIL file on a SMIL player.
- Compile (or encode) the XMT-O file to XMT-A file, and encode it to mp4.
- Render the mp4 by the MPEG-4 player.
- Verify that there are no practically observable differences between the two playbacks.
This experiment was also successful. The SMIL player used for this experiment was able to ignore
MPEG-4 specific elements and rendered compatible functionalities as expected.
The following example shows SMIL -> XMT-O conversion. The emboldened example text highlights the
differences, which are minimal.
The content is a slideshow presentation comprising three images in a sequence. Each image is
shown for a maximum of 10s and a minimum of 3s. The slideshow will play for 30s without any user interaction (3
images of 10s). If the user clicks an image after 3s then the activateEvent (event from the click) will end the
image immediately and moves the presentation to the next one. If clicked before the minimum of 3s then it continues
until it has been shown for 3s and only then does the presentation move to the next one.
<smil>
<body>
<seq>
<img src="Slide1.jpg" max="10s" min="3s" end="activateEvent" />
<img src="Slide2.jpg" max="10s" min="3s" end="activateEvent" />
<img src="Slide3.jpg" max="10s" min="3s" end="activateEvent" />
</seq>
</body>
</smil>
To convert the SMIL example above to XMT-O the <smil> element is replaced by an
<XMT-O> element. Schema namespace definitions have been omitted for clarity.
<XMT-O>
<body>
<seq>
<img src="Slide1.jpg" max="10s" min="3s" end="activateEvent" />
<img src="Slide2.jpg" max="10s" min="3s" end="activateEvent" />
<img src="Slide3.jpg" max="10s" min="3s" end="activateEvent" />
</seq>
</body>
</XMT-O>
6. CONCLUSION
This paper described the Extensible MPEG-4 Textual Format (XMT) framework. The XMT framework
consists of two levels of textual syntax and semantics: the XMT-A format, which provides a one-to-one textual format
representation of the MPEG-4 binary constructs, and the XMT-O format which provides a high-level abstraction to
content authors so they can exchange the content with other authors while preserving the original intent. The XMT-A
provides interoperability between VRML/X3D and MPEG-4, and the XMT-O provides interoperability between SMIL and
MPEG-4.
7. REFERENCES
| [1] | ISO/IEC FDIS 14496, Information Technology -- Generic Coding of Audio-Visual Objects - Part 1: System, Part 2: Visual, Part 3: Audio, Part 6: DMIF, International Organization for Standardization, 1998. |
| [2] | ISO/IEC FDIS 14772:200x, Information Technology - Computer graphics and image processing - The Virtual Reality Modeling Language (VRML). |
| [3] | "Synchronized Multimedia Integration Language (SMIL) 1.0 Specification" W3C Recommendation, http://www.w3.org/TR/REC-smil/ (June 1998). |
| [4] | ISO/IEC FDIS 14772-1:1997, Information Technology - Computer graphics and image processing - The Virtual Reality Modeling Language (VRML) - Part 1: Functional specification and UTF-8 encoding. |
| [5] | "MPEG-7: Context, Objectives and Technical Roadmap, V.12", ISO/IEC JTC1/SC29/WG11 N2861 (July 1999). |
| [6] | "Scalable Vector Graphics (SVG) 1.0 Specification", W3C Working Draft, http://www.w3.org/TR/2000/03/WD-SVG-20000303/index.html (March 2000). |

|