|
This paper presents an overview of Flextime.
MPEG-4 Flexible Timing Standard (Flextime)
|
Michelle Kim
IBM T.J Watson Research
19 Skyline Drive
Hawthorne, NY. 10532 USA
1-914-784-7709
mykim@us.ibm.com
|
Steve Wood
IBM T.J Watson Research
19 Skyline Drive
Hawthorne, NY. 10532 USA
1-914-784-7309
woodsp@us.ibm.com
|
ABSTRACT
This paper describes the declarative, flexible timing approach for streaming, multimedia
playback, as standardized in MPEG-4, that allows authors to define adaptive, presentation behavior in response to
network delivery delays, thus allowing a seamless playback experience for the end-user.
Categories and Subject Descriptors
[Multimedia networking and system support]: Audio/video streaming, Internet protocols, multimedia servers, QoS
General Terms
Algorithms, Standardization, Languages
Keywords
Flextime, MPEG-4, SMIL, Timing, XMT
1. INTRODUCTION
Streaming media solutions, to be effective in a best-effort (non-QoS) Internet environment, need
to be able to cope with network delivery delays across the media streams over the duration of the presentation
playback. The media streams may all be delivered from a single server, or, in a more advanced case, a client may be
composing streams from multiple servers that are widely dispersed over the network geography.
To better operate in such a best-effort network environment, Flextime defines a flexible,
relative timing based system so that presentation playback can be altered to accommodate network delays and avoid
disruptions to the end-users experience. In Flextime the playback of media streams can be delayed and/or their
playback durations altered in response to network delay. Content that includes Flextime allows the author to specify
both the relative timing and the bounds of the flexibility by which the playback may be altered. Flextime as such
can thus be seen in some ways as application level QoS over the presentation playback.
This paper presents Flextime, as standardized in MPEG-4, and contrasts a fixed timing model, the
traditional MPEG base timing model, against this flexible timing approach.
2. FIXED VERSUS RELATIVE TIMING
MPEG-4 [3] is a composite, visual/audio multimedia standard with a capacity to play multiple
media streams, which are received over a wide variety of network transports, in a single presentation. MPEG-4
Systems defines a framework to describe and manage these streams, and has updateable scene that can combine and
present these many media streams as a compelling, rich, interactive, dynamic composition. For each media stream it
must be buffered, decoded and the resultant output (composition units) forwarded to a rendering device for
presentation and composition to the display device.
To manage the resources and synchronization for these streams MPEG-4, in the Systems part of the
standard, defines a theoretical buffer and timing model to which an implementation must be conformant so that servers
and clients can operate together.
This theoretical model is based on precise clock timing and management of buffer memory resources
using that timing; buffer occupancy being based on decoding times such that once a decodable unit has reached its
decoding time it is extracted, decoded, and the buffer space freed for subsequent data. End-to-end delay is assumed
to be constant in this model. Clocks can be synchronized, as needed, between the transmitter and receiver using
reference clock time signals from the transmission source, which using phase-locked loop techniques during playback
aims to eliminate any clock drift.
This fixed timing model is a traditional model that MPEG has used [4] and has been designed
primarily for broadcast applications to support guaranteed playback in receivers over extended periods of time.
Clocks are synchronized this way to avoid buffer overrun or under-run problems. The delivery medium for such
broadcasts would be cable, satellite or regular UHF channels. Media streaming has, however, become increasingly
popular over other delivery mediums, in particular IP packet based networks where loss, jitter and unpredictable
delays mean that such rigid timing models are no longer so directly applicable. In addition, with interactive
content, users can also encounter variable response times that may adversely influence their experience with the
content leading to negative reactions in such an environment.
Extra work must therefore be done to alleviate the effects of such delivery problems. A popular
technique includes adding buffering to absorb the variable delivery delays and jitter. This approach works well
where maximum delay is deterministic but adds delay and affects perceived responsiveness. The buffer memory is
sometimes reduced to minimize this extra resultant delay; this improves the end-user experience but comes at a risk
of creating discontinuities in the media playback if delay increases. The buffering can also be made adaptive to the
network conditions and so on.
The synchronization problems get worse when media content stored on a number of widely
distributed servers is used in a single presentation. A client player would aim to integrate such streaming content
into a single seamless presentation. Here the traditional fixed timing model does not easily address synchronization
among a collection of uncoupled media transmission sources. Using a fixed timing model requires considerable efforts
to effectively realize content insertion from a source on one time-base into another.
The problems described above, which arise from unpredictable delivery delay and integration from
multiple sources, can be dealt with though in an alternate manner using a timing model that allows flexibility in
synchronization. Instead of using fixed timing, this model would use a declarative relative timing, expressing
timing relationships between the various media elements that make the presentation and allow flexibility in their
playback duration. Such a flexible timing model is now part of the MPEG-4 standard and it's called Flextime, or the
Advanced Synchronization model.
The Flextime model augments the traditional MPEG-4 timing model with two new key features:
- The start and end times of media objects are made relative to other media objects in the presentation, rather than
having absolute starting and ending times. Media objects are thus connected to one another by a relationship.
- The duration of a media object can have a flexible duration, with bounded constraints for minimum and
maximum allowed playback time. This is in contrast to a fixed duration defined by absolute start and end times.
The connection relationships of the media together with their duration flexibility can be
regarded as application level quality of service that expresses the resultant acceptable presentation behavior in the
presence of network delays. This behavior, expressed by an author in a declarative syntax, is transmitted as part of
the presentation. A client player can use these media connection relationships to alter (flex) the playback
dynamically so as to adapt the timing of the ongoing presentation in response to delays in the delivery environment.
Such flexibility allows media playback to be adjusted to maintain a seamless presentation, without discontinuities,
and stay within the application author's expressed intentions and expectation. The impact of this flexible timing
functionality will be most felt in Internet applications and enhanced broadcast applications where multiple sources
need to be synchronized at the consumer device.
3. FLEXTIME MODEL
In the FlexTime model, the timing of media objects is not absolute; rather it is defined, by
means of a relationship, to be relative to other media objects. FlexTime defines three temporal relationships for
media objects, which are taken from The temporal relations CoStart, Meet and Coend, are based on James Allen's
temporal algebra [1].
| CoStart: | to start playback of two or more objects at the same time |
| Meet: | to start playback of a media object when another ends |
| CoEnd: | to make the end of the playback of two or more media objects occur at the same time |
The Flextime model allows the content author to express synchronization among MPEG-4 objects,
streams or stream segments, by assigning these temporal relationships among them. Flextime also includes a null
media object to model a simple time delay.
For each media object a flexible, playback duration is defined by Flextime. This flexible
duration is expressed using three components: an optimal (or preferred) duration, representing the optimal or natural
playback duration for the media, plus a maximum playback duration and a minimum playback duration. A media object's
playback duration can then be stretched or shrunk within these author specified constrained limits.
Thus, when using Flextime, a media presentation can be considered as a set of connected flexible
media objects. If the media objects are viewed as springs, connected by rigid wires according to the relationships,
then the presentation can be viewed as an elastic stretchable entity. Each spring (media object) in the set will
have bounded constraints for its minimum and maximum size. Then at playback time the media objects durations can be
adjusted according to the conditions experienced. With the entire presentation expressed this way a client can adapt
the playback aiming always to stay close to the preferred playback duration but stretching or shrinking to
accommodate the effects of network delivery conditions.
4. FLEXTIME SCHEDULING
In FlexTime, each media object is associated with a triple of lengths, or durations, (min, opt,
max), where min <= opt <= max, such that m may be presented over a time interval whose length falls in the
range bounded by min, a minimum, and max, a maximum., and that opt is the most desirable length. We will call this
triple "three spring constants" for obvious reasons.
The media objects may be connected to each other using temporal relations as described in the
FlexTime Model above. As such a set of connected media objects each of which with different spring constants can be
seen as a constraint system, a temporal constraint system, in this case. Such a constraint system can be evaluated for
consistency to examine if all the requirements can be satisfied. If there is at least one way to present it, we say
it is consistent. If there is no way to satisfy all the constraints, the author will need be informed. This
constraint system can also be used at run time. We describe in this section an algorithm that can be used to check
the consistency of our constraint system. We will also describe an algorithm that can be used to provide a runtime
scheduling of flexible objects.
4.1 Consistency Check
If X and Y are two time points, a constraint on their temporal distance can be represented as a
binary constraint of a form c1<= X-Y <= c2, which would give a rise to a set of linear inequalities on the time
points.
For example, given a media object whose start time is S1, end time is E1, a corresponding binary
constraint can be denoted by:
min <= E1 - S1 <=max, where min is the minimum and max is the maximum permissible
duration.
If this object "meets" with another object whose start time is S2, then this relationship can be
specified by:
0 <= E2 - E1 <=0.
A network of such constraints consists of time points and a set of temporal relationships among
them. This special class of linear inequalities can be conveniently represented in a graph. For example, the
inequality above: min <= E1 - S1 <=max can be represented with two vertices E1 and S1 which are connected with
2 paths each labeled by - min and + max.
This graph representation can be solved by applying a shortest paths algorithm. As in Rina
Dechter [6], we solve this system by constructing a distance graph give a set of linear inequalities and applying to
it the Floyd-Warshall's app-pairs-shortest-paths algorithm as in below.
1. for i := 1 to n do d ii B0;
2. for i, j := 1 to n do d jk B a ij;
3. for k := 1 to n do
4. for i,j := 1 to n do
5. d ij B min {d ij, d ik + d kj };
a ij is the distance between points i and j as in the distance graph, and d ij is the shortest
distance between two points i and j thus computed.
A given network is consistent if and only if its distance graph has no negative cycles. This
algorithm with (O3) complexity, if the system is consistent, also produces two sets of possible start
points for all the objects: earliest possible times and latest possible times. A more detailed treatment of the
Flextime model (or Elastic Time) can be found in [2].
Note that in this algorithm optimal durations have been ignored. However, if there is no
delivery problem over the network optimal durations can be used for all the objects if the resulting distance graph
is consistent. Consistency can be checked by building another distance graph this time using the optimal value for
both min and max for each object and by applying the shortest path algorithm as above.
4.2 Progressive Runtime Scheduling
In actual playback, a playback duration is chosen that lies within the range of its possible
durations based on the network condition and the given constraints. The problem of determining for each object
acceptable start time and duration is described here. The method is progressive in time. It resolves the absolute
start time as time advances, making it adaptive to the changing environment.
Let us consider an example in Example 1. This presentation contains an image, a video, and a
text segment connected accordingly each with its spring constants. The labels p1, p2, p3 indicates connection time
points. We'll assume p1=0 without losing generality
Example 1
Suppose that the player first receives the image and the text at time p1 and renders them. If
there is no delay, the image will disappear after 6 seconds, and the video will start, and it will end with the text
at p3 which is 16 seconds after p1. But suppose the video arrives late, with a delay of 1 second. Since the max
duration of the image is 11, the image stays for one more second and the video starts. The player then attempts to
resolve the durations of the video and the text. It does this using the constaints:
- p2 + 8 <= p3 <= p2+ 12;
- p1 + 13 <= p3 <= p1 + 19.
Assuming that p1=0 and knowing that p2=7, we obtain the tightest bounds on p3:
With this we can recalculate the new minimum and maximum possible durations for both video and
the text.. If newly defined min/max durations are such that the optimal duration falls out of range, it needs to be
reset to a feasible value either to the min or the max whichever is closer to the previous optimal value. In this
case the optimal durations are still well within the range, thus they need not be readjusted.
- video = (8, 10, 12)
- text = (15, 16, 19)
Finally, we can use these (new) optimal durations to determine the time at point p3. Note that
in our example the time at p3 can be either 17 (if the video is played for a preferred duration) or 16 (if the text
is rendered for a preferred duration). An appropriate choice can be made depending on the nature of the application.
We can also define a cost function, or an error criterion, and formulate our problem as a
minimization problem of the objective function. Our general cost function can be:
C = (X1 - opt 1)2 + ... + (Xn - opt n)2
We can solve this minimization problem using the quadratic programming technique.
Note that only when playback has started or finished of a particular media object, the absolute
time of a certain label will become known. And thus the resolution of time labels can proceed progressively over
time recursively.
5. FLEXTIME IN MPEG-4
With media objects it is important to note that there are two categories in MPEG-4. The timing
and rendering of an MPEG-4 media object that uses an Elementary stream, such as video or audio, is not determined by
the stream alone, but also by the corresponding nodes in the scene that are required for its presentation. Whereas
the timing and rending of an MPEG-4 media object that does not use a stream, such as text or rectangle, is determined
only by the corresponding nodes and their timing.
The Flextime, or Advanced Synchronization model, is expressed in MPEG-4 using TemporalGroup and
Temporal Transform nodes in the scene. There is also a SegmentDescriptor to identify segments within media streams.
With these tools media objects, or segments thereof, can be connected via relationships and given flexible
durations.
The TemporalGroup node specifies the temporal relationship between the set of its children, the
TemporalTransforms, which represent the media objects. A TemporalGroup may itself be a child of another
TemporalGroup allowing nested expression of relationships. The TemporalGroup can examine the temporal properties of
its children, and when the temporal relationships are met, start and stop playback accordingly.
The relationships in the TemporalGroup are coStart, coEnd and meet fields. If coStart is true,
all child nodes are activated simultaneously and playback commences. When meet is true, the child nodes are
activated one after another in sequence; when one node ends and is deactivated, the next node in the list is
activated and started. When coEnd is true the child nodes are all deactivated together and the playback of the set
of media objects ends. A node may be a single media object in a TemporalTransform or a group of them under another
TemporalGroup.
The TemporalTransform provides for synchronization of nodes within the scene to a delivered media
stream, or a media stream segment, and also supports flexible transformation of scene time. The TemporalTransform
contains children nodes in the scene whose time base is to be made flexible. It supports slowing down, speeding up,
freezing or shifting of the scene time for the rendering of its children nodes.
The TemporalTranform node has the fields that specify the flexible duration bounds and mode
preferences for any duration adjustment. The optimalDuration field specifies the optimal playback duration of the
objects that are controlled by this node. To specify minimum and maximum durations there is a scalability field,
which specifies the maximum ratios by which this object is allowed to shrink or stretch. If the preferred duration
is known, the ratio determines the absolute values of the minimum and maximum duration. Otherwise, for unknown
durations, the scalability field dictates the ratio by which the time bases controlled by the node are allowed to
scale. In the TemporalTransform the author specifies the preferred modes of stretching (e.g., linear stretching or
holding on to the last access unit) or shrinking (e.g., linear shrinking or stop rendering) using the stretchMode and
shrinkMode fields respectively.
An MPEG-4 player using the Flextime model can:
- Compensate for various network delays by supporting a timed wait, for the initial arrival of a stream, before the
player starts rendering/playing the node(s) associated with it.
- Compensate for various network delays by supporting a timed wait for the arrival of the stream segment.
- Synchronize multiple media/BIFS nodes with media streams of unknown length.
- Slow down or speed up the rendering/playback speed of portions of streams to re-adjust out-of-sync situations
caused by unknown length, uncontrolled arrival time or variation.
6. SMIL TIMING AND FLEXTIME
SMIL 2.0 [5] is a W3C recommendation for multimedia presentation. SMIL 2.0 defines a set of XML
modules where media spatial placement is specified using a layout construct, while temporal placement is specified
using time containers and synchronization arcs. The high level textual format in MPEG-4 re-uses the SMIL timing
module. That will be covered next but we provide here a focused look at SMIL timing, concentrating on the relative
timing and flexibility that is also part of the MPEG-4 textual format.
SMIL has the two following basic time containers, <par> and <seq>. The <par>
container plays its contained children in parallel and the <seq> container plays its children one after the
other in a sequence. There are a correspondences between the <par> and <seq> containers in SMIL and the
TemporalGroup in MPEG-4. A <par> corresponds to a TemporalGroup where its children are specified to CoStart
and CoEnd. A <seq> corresponds to a TemporalGroup where its children are specified to Meet.
To create a complete presentation in SMIL the <par> and <seq> time containers can be
nested as required.
Now in Flextime there is a bounded, playback duration for a media object, represented by the
minimum and maximum durations. SMIL 2.0 has min and max attributes that can be applied to timed objects. Min and
max in SMIL likewise serve to bound the active duration to the specified minimum and maximum values.
The <par> and <seq> containers, plus the min and max attributes are part of the
fundamental basic timing in SMIL.
As a more advanced feature, SMIL also contains a Time Manipulations module that allows speed of
playback to be specified for time objects. There is a correspondence here to the speed field in the MPEG-4 Systems
TemporalTransform, which adjusts the in scene playback speed of media objects.
So both SMIL timing and MPEG-4 Flextime are expressions of relative timing rather than absolute
timing. SMIL timing has relationships expressed by containers and synchronization arcs, Flextime has relationships
expressed by TemporalGroup node that contains children connected together by coStart, meet and coEnd
relationships.
7. XMT FLEXTIME REPRESENTATION
The XMT (eXtensible MPEG Textual) format is an XML representation of MPEG-4 Systems as defined by
MPEG. The XMT has two levels, XMT-A and XMT-O.
XMT-A: This is a one to one representation of the binary constructs that are defined in MPEG-4
Systems. Using XMT-A a presentation can be created that directly specifies each MPEG-4 tool that should form the
presentation. Each stream descriptor must be defined along with commands to create the nodes in the scene. At this
level TemporalGroup nodes and TemporalTransform nodes are directly coded to create the required flexible timing.
XMT-O: This is a higher level of abstraction for MPEG-4 that incorporates modules as define by
SMIL, in particular here the timing module. The intent of XMT-O is to provide content interoperability and with
these higher-level constructs represent more of the author's intentions of what the presentation should be rather
than how it should be done. An analogy here might be that XMT-O, when compared to XMT-A, is somewhat like comparing
C++ to assembler. When writing in C++ the programmer can use higher-level constructs that can later be converted to
appropriate machine level code. This allows alternate conversions to machine code representations and alleviates
having to concern oneself with details of machine level code. In the same way an author in XMT-O can express a
high-level constructs and later have them compiled to appropriate MPEG-4 Systems binary format; this includes the
timing constructs.
In re-using SMIL the XMT-O includes the <par> and <seq> containers. From the
previous section we know these can be mapped to MPEG-4 Systems TemporalGroup, with any contained media objects as
children under TemporalTransform nodes. The TemporalTransform contains the corresponding min and max attribute
values, plus it also has the speed attribute.
XMT-O has one addition in the time manipulations area. This is an attribute called flexBehavior
through which the Flextime preferred modes of stretching and shrinking are specified. This attribute can be seen in
the example that follows.
8. EXAMPLE
The following example shows how flexible timing is specified in MPEG-4, using XMT-O, along with a
pictorial representation of the media objects and their connections. A scenario will then be presented that shows
how playback will operate in the presence of delay.
The example, in XMT-O, describes a sequence of two media objects, an image that is followed by a
video, a subset of the Example 1 we have used above. The sequence is bounded by minimum and maximum playback times,
as is each media object.
<seq min="13s" dur="16s" max="19s">
<image src="Image.jpg" min="4s" dur="6s" max="11s"
flexBehavior="hold; stop"/>
<video src="Video.m4v" min="8s" dur="10s" max="12s"
flexBehavior="linear; hold: linear; stop"/>
</seq>
This can be represented pictorially as follows. Using Flextime relationships to express the
connections we say that the Image and Video Meet. The minimum, optimal and maximum durations of the <seq>
container are represented by a null object that has a coStart connection to the Image and a coEnd connection to the
Video. The min, opt and max durations are depicted as the list of three numbers in parentheses e.g. (4, 6, 11) for
the Image.
To convert this into MPEG-4 nodes we can group the Image and the Video under one TemporalGroup
node with its meet field set true. The Image and Video will be children of two TemporalTransform nodes where the
min, opt and max durations will be specified for each. This TemporalGroup will then be contained in a second
TemporalGroup that has a single TemporalTransform representing, as a null media object, the <seq> min, opt, and
max delays. This second TemporalGroup will have both coStart and coEnd fields set true.
The expected behavior of this example is as follows. In the ideal case, where the media arrive
in time, both media play for their optimal durations so that the Image plays for 6 seconds followed by the Video for
10 seconds; the total playback duration being 16 seconds. Now if there are delays then there is the flexibility that
either or both the durations of the Image and Video can be altered within their respective bounds. Any adjustments
however must take into account the 13s minimum and 19s maximum of the sequence itself.
Let's examine some scenarios using the example. Consider the case where the sequence starts on
time in the context of being a fragment of a larger presentation. The image is specified to play for an optimal
duration of 6s. At some point in time the transmission of the video will be initiated and the video data will start
to arrive. If the video arrives on time then 6s into the sequence, when the image should optimally end, the video's
playback can be started. This is the ideal situation where the data is available on time when needed.
Suppose though the video is late in arriving. To ensure there are no gaps in the presentation
the active duration of the image can be stretched; in this example we can stretch to a maximum duration of 11s. Now,
the video playback length is 10s and the overall sequence container's max duration must not exceed 19s. Thus if the
video arrives after 6s, but before 9s, the video can be played at normal speed and remain within the limits of the
max duration of the sequence. If the video arrives after 9s, but before 11s, the video can still be played but its
active duration must be shrunk, so that the 19s maximum is not exceeded, by either linearly increasing the playback
speed or by just simply truncating the video.
The methods for shrinking are listed in the preferred shrink modes of flexBehavior attribute and
in this case are 'linear, stop'. The mode is chosen by the player according to the author's preference, as expressed
by that list, and according to the player's capability to shrink the particular media in those modes. So if the
image gets stretched to the max duration of 11s then the video can be played faster, in this case to 125% of normal
playback speed, so that it plays at its min duration of 8s in order that the sequence can finish in time. If the
player cannot speed up the video playback linearly, it can simply stop the video short as the author has expressed
'stop' as the second preferred mode.
So it can be seen, that despite the delays that occurred, the presentation made to the end user
was seamless with no discontinuities. The author chooses the flexibility for each media objects as applicable such
that then there exists a range of playback possibilities that can accommodate delays. In the example the author
allowed the video to be stretched of shrunk; perhaps it could have been a panning view of a rainforest following an
image of an animal living there as part of some educational material on rain forest life. Altering the playback here
was acceptable to the author of this example.
In the above analysis of the video arriving late, it might not be clear why there is a
possibility of showing the image for a shorter duration than 6s. However, in the context of a larger presentation
earlier delays may now mean that the presentation is behind its optimal schedule. In this case the image duration
can be shrunk, as well as the video too, to meet any higher-level goals that may be expressed for the presentation.
For example there may be an overall par container with min and max values for the presentation. The entire
presentation is then only flexible within those upper limits.
9. REFERENCES
| [1] | Maintaining Knowledge about Temporal Intervals, J. Allen, CACM, 26(11), 1983. |
| [2] | Multimedia Documents with Elastic Time, M. Kim and J. Song, Proceedings of ACM Multimedia, 1995. |
| [3] | ISO/IEC. 14496-1:2001. Amendment 2. Text of 2nd Extension to the 2nd Edition of MPEG-4 Systems. FPDAM, Doc. ISO/MPEG N4268. Sydney MPEG Meeting, July 2001b, 2001 |
| [4] | Digital Video: an Introduction to MPEG-2, B. Haskell, A. Puri, A. Netravali, International Thompson Publishing, 1997. |
| [5] | W3C. Synchronized Multimedia Integration Language (SMIL 2.0). W3C Recommendation, http://www.w3.org/TR/smil20/. August 2001 |
| [6] | Temporal constraint networks, Rina Dechter et al., Artificial Intelligence 49, 1991. |

|