While generalized minimax search is an efficient procedure
for successively generating optimal price tables of greater
and greater depth, it suffers from some theoretical
inconsistencies. First, the generation of an n-step
optimal price table for a given seller is based on the
assumption that the opponent will be using an (n-1)-step
policy. This is problematic if one believes that the
opponent is using a search algorithm of equal depth.
Furthermore, there is no way that both sellers
could be correct in assuming that the opponent
is using lesser-depth search. Second, there is the problem
that successive steps in the predicted trajectory are
based on progressively shallower depth search, until finally at
the end of the trajectory, the predicted pricing is myopic.
Such predicted trajectories are unlikely to match actual
trajectories, as it seems reasonable to assume that agents
will use a given fixed depth search every time they are
called upon to set a price.
As a way of overcoming these potential problems, we suggest
that generalizing the formalism of Dynamic Programming from
single-agent Markov decision problems to
two-player arbitrary-sum games is a promising area of research.
Such an approach could
lead to the development of algorithms that yield optimal
policies for both players that are fully accurate and
self-consistent. Furthermore, DP-like approaches can be
extended to large state spaces in which it is not feasible
to use lookup tables for all possible states in the state
space. This has been shown by numerous works in the field
of Reinforcement Learning, in which the lookup tables of DP
are replaced by compact state representations and nonlinear
function approximators, and the full sweeps through the state
space of DP are replaced by following actual trajectories
generated by agent policies.
Within the context of the price-quality and information-filtering
models mentioned previously, we have investigated a number
of heuristic generalizations of DP, which attempt to do
simultaneous, self-consistent optimization of the optimal
price curves for both sellers. So far we have only examined
the case of two-step lookahead, i.e., each seller optimizes
two-step utility and also models the other seller as optimizing
two-step utility. Within this context, we have
examined a simple alterating policy iteration approach
in which we start with
initial curves
and
and alternately optimize
and
assuming that the
other one is held fixed. Empirically, this approach was not
found to converge to a unique, self-consistent solution.
We have also looked at an asynchronous stochastic version
of the above algorithm, in which one element at a time
is chosen at random from each optimal price table, and only
that element is optimized. This approach was not found to
converge, either.
However, we have found empirical convergence with an approach
based on an incremental version of DP, in which
the prices are moved towards the computed optimal prices
by small amounts. (This somewhat resembles a gradient-style
minimization of the amount of inconsistency between the two
optimal price curves at each time step.) When the asynchronous,
stochastic algorithm mentioned above was combined with
incremental price adjustments, it quickly converged in the
price-quality model to the exact same curves (apart from very
small random fluctuations of less than 1%) as obtained by
3 or more lookahead steps in the generalized minimax procedure.
A plot of the DP-generated optimal price curves is shown below
in figure 5.
We note that visually this plot appears identical
to the lower right-hand plot in figure 2, and leads to
fixed-point dynamics rather than cyclic price wars.
Figure 5: Two-step optimal prices for seller 1 vs. seller 2
calculated by stochastic, asynchronous, incremental DP
in price-quality model.
The same stochastic, asynchronous, incremental DP
algorithm was tested in the information-filtering model, and
when the number of consumers in the simulation was large
(10,000 to 250,000) it gave similarly good approximate
convergence within small fluctuations to stationary
optimal price curves. However, with only 1000 consumers
in the model, the fluctuations were larger and it was not
entirely clear whether the algorithm was converging to a
stationary solution. We conjecture that this behavior may be
due to differing degrees of smoothness in the utility
landscapes. The analytically-defined landscapes in the
price-quality model are very smooth,
and are also smooth in the filtering model
as the number of consumers becomes infinitely large.
However, the landscapes become much more ragged with a
small finite-size consumer population, and such raggedness
may limit the ability of the DP-like approach to converge to
a stationary solution.
An example of results using the DP-like approach
in the information-filtering model is shown in figure 6.
This figure plots the two-step optimal price curves for
seller 1 and seller 2 for the same numerically generated
utilities as were used in figures 3 and 4.
We do find good approximate convergence to stationary curves
in this model. The small random fluctuations are somewhat larger
than those seen in figure 5; most likely
this is due to a coarser resolution in the optimal price tables
(.005 vs. .002). In this particular model, we can see that
the DP-generated optimal price curves still lead to a price war
that actually has a slightly larger amplitude than in the myopic case
shown previously in figure 3.
Hence there are at least some cases in which the use of lookahead
in the sellers' pricing strategy can actually exacerbate the price-war
dynamics. Determining under what conditions price wars are
amplified by lookahead, and under what conditions they are
diminished or eliminated, is an important topic for further research.
Figure 6: Two-step optimal prices for seller 1 vs. seller 2
calculated by stochastic, asynchronous, incremental DP
in the same information-filtering model as shown in
figures 3 and 4.