In prior work (Kephart et al., 1998; Sairamesh and Kephart, 1998)
it has been shown that
in large-scale economies of software agents, the potential
exists for unending cycles of disastrous competitive
``wars'' in price/product space. Such price and niche wars
are disastrous not only for the sellers in such economies
(e.g. information sellers in an information filtering
economy), but they can also be disastrous for the consumers
as well. Price wars have the potential to be
more rampant in agent economies than what we have observed
in human economies due to a number of differences between
agent and human economic players. Among such differences
are: (1) greater ability of humans to predict long-term
consequences of their price-setting actions; (2) reduced
frictional effects such as consumer inertia in agent economies;
and (3) reduced localization effects due
to much greater connectivity offered by the Internet.
In this paper, we focus specifically on the use of foresight
or predictive capability by software agents to anticipate
retaliatory responses that will be made by other agents.
Our intuition is that this is the most important factor in
models studied so far that generate price wars.
In (Kephart et al., 1998), it was shown
that price-war phenomena are generally obtained in agent
economies when the agents can do a good job of optimizing
their immediate short-term reward or utility, but have no
capability to predict system behavior beyond the current
time step. Such ``myopic optimal'' agents may be tempted
to engage in price-undercutting behavior to obtain an immediate
short-term advantage. However, if the other agents retaliate by
further undercutting, then the agent may be worse off in the
long term than if it had kept its price at a high level
and not attempted any undercutting.
Our intuition is that if agents can be endowed with some
sort of predictive capability that could anticipate
longer-term consequences of current pricing behavior,
then this could allow the agents to realize the futility
of undercutting. Implementing foresight mechanisms
on a large scale throughout an agent economy might then
lead to radically different system behavior,
and price wars might be greatly reduced or eliminated.
In considering algorithmic approaches to agent foresight,
one set of issues that arises has to do with which type of
algorithms will enable agents to make the most accurate
predictions, given the constraints of a limited information set
and limited computational resources.
An additional set of issues has to do with what sort of collective
system consequences result when a significant fraction of the
agents base their actions on predictions. Will a society of
machine learners converge to something
like a game-theoretic solution (for example, a Nash equilibrium),
or will they just endlessly
chase one another's tails? The former question has been
studied, for example, in (Milgrom and Roberts, 1991) and
(Foster and Vohra, 1997), while the latter issues are beginning
to be addressed, for example, in
(Hu and Wellman, 1996) and (Vidal and Durfee, 1998).
In the present work, we are primarily concerned with issues
of the depth and accuracy of agent lookahead. That is to say,
how far ahead in time can an agent reliably predict those aspects
of the future system behavior that are relevant to their
current decisions, and how much lookahead is necessary to
avoid pathological behavior? Is shallow lookahead sufficient,
or is it necessary for agents to engage in deep lookahead in
order to avoid price wars? Likewise, we would like to know
how accurately an agent can predict, and how much accuracy is
needed to avoid pathological behavior? It may be the case that
only a coarse prediction is needed to avoid price wars. (For
example, will any other agent set a price lower than mine on
the next time step?)
Our proposed algorithms for agent foresight are designed to
avoid the classic problem of infinite recursion of opponent
models. That is to say, when modeling other agents, one needs to
take into account the fact that those other agents are themselves
using models of other agents, and that those models need to take into
account that the other agents are using models, etc..
This can lead not only to logical problems in setting up the
agent models, but also to greatly increasing levels of
computational complexity with the depth of recursion.
For example, in the work of (Vidal and Durfee, 1998), a recursive
modeling scheme is proposed in which 0-level agents do no
opponent modeling, 1-level agents model the other agents as
being 0-level agents, 2-level agents model the other agents
as being 1-level agents, etc.. In this scheme, the computational
requirements greatly increase with the level of modeling,
and furthermore, there is no adequate way for an agent to
model other agents as being at the same level of depth.
We consider two basic heuristic approaches to avoiding an infinite
recursion of opponent models. The first approach is adapted
from the domain of two-player zero-sum games such as chess, in which
full-width minimax search to a fixed finite depth has been found
to be an effective algorithm. In this case, the infinite
recursion is cut off by the finite depth of the search.
Minimax search is only guaranteed to find optimal moves if
the search goes all the way to the end of the game; however,
it does seem to work well in practice for searches of lesser
depth. For example, the chess machines Deep Thought and
Deep Blue give the impression of generating sophisiticated
positional understanding as an emergent property from deep
searches plus simple positional knowledge built into the
evaluation function.
The second approach that we explore is to adapt algorithms
from the fields of Dynamic Programming (DP) and
Reinforcement Learning (RL); such algorithms have been found
to work well for single agents in stationary Markov environments
(i.e. Markov Decision Problems, or MDPs).
The basic idea of DP/RL is ``Policy Iteration'' (Bertsekas, 1995),
in which one starts with an initial policy, computes the value function
induced by that policy, and then computes an improved policy
that is greedy with respect to that value function. Policy
iteration is guaranteed to converge to the optimal agent policy
for single-agent MDPs. Recently there has been some work
generalizing DP-type algorithms to two-player Markov games.
For example, (Littman, 1994) introduced an algorithm called
minimax-Q for two-player zero-sum games in which the players
alternately take turns moving, which is guaranteed to converge
to the optimal policies for both players. Unfortunately,
we can't directly use minimax-Q in agent economies because
the agent utilities are not strictly zero-sum. We investigate
in this paper several heuristic DP-like approaches which
can be used in arbitrary-sum games.
As a general caveat, we should point out that both classes
of algorithms mentioned above seek deterministic
optimal policies. (Here ``optimal'' is used in the game-theoretic
sense of the best worst-case behavior against all possible
opponent strategies.) However, it may the case that such
deterministic policies do not exist. For example, in the
game of rock-paper-scissors,
any deterministic policy can be defeated, and the best policies
are non-deterministic and cannot be computed by these methods.