Notes of Multi-agent deep reinforcement learning: a survey

Citation:
Gronauer, Sven, and Klaus Diepold. “Multi-Agent Deep Reinforcement Learning: A Survey.” The Artificial intelligence review 55.2 (2022): 895–943. Print.

Links:
Notes of Multi-agent deep reinforcement learning: a survey (1)
Notes of Multi-agent deep reinforcement learning: a survey (2)
Notes of Multi-agent deep reinforcement learning: a survey (3)
...

Introduction

  • What is a multi-agent system?
    A multi-agent system describes multiple distributed entities—so-called agents—which
    take decisions autonomously and interact within a shared environment (Weiss 1999).

  • How do agents work?
    Each agent seeks to accomplish an assigned goal for which a broad set of skills might be required to build intelligent behavior. Depending on the task, an intricate interplay between agents can occur such that agents start to collaborate or act competitively to excel opponents.

  • Why RL?
    Specifying intelligent behavior a-priori through programming is a tough, if not impossible, task for complex systems. Therefore, agents require the ability to adapt and learn over time by themselves.

  • RL history
    The ML epoch
    Stone and Veloso (2000)
    analyzed multi-agent systems from a machine learning perspective and classified the reviewed literature according to heterogeneous and homogeneous agent structures as well as communication skills.
    discussed issues associated with each classification.

    Shoham et al. (2003)
    criticized the ill-posed problem statement of MARL which is in the authors’ opinion unclear and called for more grounded research.
    proposed a coherent research agenda which includes four directions for future research.

    Yang and Gu (2004)
    reviewed algorithms and pointed out that the main difficulty lies in the generalization to continuous action and state spaces and in the scaling to many agents.

    Busoniu et al. (2008)
    presented selected algorithms and discussed benefits as well as challenges of MARL. Benefits include computational speed-ups and the possibility of experience sharing between agents. In contrast, drawbacks are the specification of meaningful goals, the non-stationarity of the environment, and the need for coherent coordination in cooperative games.
    posed challenges such as the exponential increase of computational complexity with the number of agents and the alter-exploration problem where agents must gauge between the acquisition of new knowledge and the exploitation of current knowledge.

    Matignon et al. (2012b)
    identified challenges for the coordination of independent learners that arise in fully cooperative Markov Games such as non-stationarity, stochasticity, and shadowed equilibria. analyzed conditions under which algorithms can address such coordination issues.

    Tuyls and Weiss (2012)
    accounted for the historical developments of MARL and evoked non-technical challenges.
    criticized that the intersection of RL techniques and game theory dominates multiagent learning, which may render the scope of the field too narrow and investigations are limited to simplistic problems such as grid worlds.
    claimed that the scalability to high numbers of agents and large and continuous spaces are the holy grail of this research domain.

    The DL epoch
    Nguyen et al. (2020)
    presented five technical challenges including nonstationarity, partial observability, continuous spaces, training schemes, and transfer learning. They discussed possible solution approaches alongside their practical applications.
    Hernandez-Leal et al. (2019)
    concentrated on four categories including the analysis of emergent behaviors, learning communication, learning cooperation, and agent modeling. Further survey literature focuses on one particular sub-field of MADRL.
    Oroojlooyjadid and Hajinezhad (2019)
    reviewed recent works in the cooperative setting
    Da Silva and Costa (2019) and Da Silva et al. (2019)
    focused on knowledge reuse
    Lazaridou and Baroni (2020)
    reviewed the emergence of language and connected two perspectives, which comprise the conditions under which language evolves in communities and the ability to solve problems through dynamic communication.
    Zhang et al. (2019)
    focused on MARL algorithms and presented challenges from a mathematical perspective.

Background

  • What is traditional reinforcement learning problem?
    The traditional reinforcement learning problem (Sutton and Barto 1998) is concerned with learning a control policy that optimizes a numerical performance by making decisions in stages.

  • MDP definition
    Markov decision process (MDP)
    A Markov decision process is formalized by the tuple {(X,U,P,R,\gamma)}
    where
    {X} and {U} are the state and action space
    {P:X \times U \rightarrow P(X)} is the transition function describing the probability of a state transition
    {R:X \times U \times X \rightarrow ℝ} is the reward function providing an immediate feedback to the agent
    {\gamma} ∈ [0, 1) describes the discount factor

  • Single-agent RL Optimization process
    Notes for Multi-agent deep reinforcement learning: a survey (extra:Model)

  • MG definition
    Markov Games (MG)
    The Markov Game is an extension to the MDP and is formalized by the tuple (N, X, \{U^i\}, P, \{R^i\}, \gamma )
    where
    {N=\{1,...,n\}} denotes the set of {n > 1} interacting agents
    X is the set of states observed by all agents
    {U=\{U^1,U^2...,U^n\}} denotes the joint action space from agents {i\in N}
    {P:X \times U \rightarrow P(X)} is the transition probability function
    {R=\{R^1,R^2...,R^n\}} denotes the reward functions for each agents {i\in N}
    {\gamma} ∈ [0, 1) describes the discount factor

  • Multi-agent RL Optimization process
    Notes for Multi-agent deep reinforcement learning: a survey (extra:Model)

  • MARL category
    Based on the taxonomy of the reward structure:

    1. Fully cooperative setting
      All agents receive the same reward {R = R^i = ⋯ = R^N} for state transitions. In such an equally-shared reward setting, agents are motivated to collaborate and try to avoid the failure of an individual to maximize the performance of the team. More generally, we talk about cooperative settings when agents are encouraged to collaborate but do not own an equally-shared reward.
    2. Fully competitive setting
      Such problem is described as a zero-sum Markov Game where the sum of rewards equals zero for any state transition, i.e. R = \sum^N_{i=1}R^i(x,u,x')=0. Agents are prudent to maximize their own individual reward while minimizing the reward of the others. In a loose sense, we refer to competitive games when agents are encouraged to excel against opponents, but the sum of rewards does not equal zero.
    3. Mixed setting
      Also known as general-sum game, the mixed setting is neither fully cooperative nor fully competitive and, thus, does not incorporate restrictions on agent goals.
      Based on the taxonomy of information available to the agents:
    4. Independent learners
      The agent ignores the existence of other agents and cannot observe the rewards and selected actions of others.
    5. Joint-action learners
      The agent observes the taken actions of all other actions a-posteriori.
  • MARL challenges

    1. Agents update their policies during the learning process.
      Result in: Non-stationarity
      A single agent faces a moving target problem when the transition probability function changes
      {P(x'|x,u,\pi^1,...,\pi^n)\neq P(x'|x,u,\overline{\pi}^1,...,\overline{\pi}^n),}

      due to the co-adaption {\pi^i \neq \overline{\pi}^i \ \ \ \exists i \in N} of agents.

    2. Agents converge to sub-optimal solutions or can get stuck between different solutions.
      Result in: Pareto-selection problem
      Shadowed equilibrium definition:
      A joint policy {\overline{\pi}} is shadowed by another joint policy {\hat{\pi}} in a state x if and only if
      {V_{\pi^i,\overline{\pi}^{-i}}(x)\lt \min_{j,\pi_j}V_{\pi^j,\hat{\pi}^{-j}}(x)\ \ \ \exists i,\pi_i}

      And in cooperative settings when each agent’s policy performs relatively well when paired with arbitrary actions from other agents,
      Result in: Relative overgeneralization
      A sub-optimal Nash equilibrium in the joint action space is preferred over an optimal solution.

    3. For complex systems, complete information might not be preceivable.
      Result in: Partial observation
      POMG definition:
      Partially observable Markov Games
      The POMG is mathematically denoted by the tuple
      {(N,X,\{U^i\},\{\sigma^i\},P,\{R^i\},\gamma)}

      where
      {N:} the set of agents, {N=\{1,...,n\}:}
      {X}: the state space, {X=\{x_1,x_2...x_t\}}
      {U=\{U^1,U^2...,U^n\}:} the joint action space from agents {i\in N}
      {\sigma = \{\sigma^1,\sigma^2,...,\sigma^n\}:} the observation space from agents {i\in N}
      {P:} the transition probability function
      {R=\{R^1,R^2...,R^n\}:} the reward functions for each agents {i\in N}
      {\gamma \in [0,1):} the discount factor

      In a cooperative task with a shared reward function, the POMG is then known as decentralized Partially Observable Markov decision process (dec-POMDP).
      The history of interactions becomes meaningful
      {\Rightarrow} The inference of good policies is extended in complexity.
      {\Rightarrow} The agents usually incorporate history-dependent policies
      {\pi^i_t:\{\sigma^i\}_{t \gt 0}\rightarrow P(u^i)}

    4. In the fully-cooperative setting with a shared reward function, an individual agent cannot conclude the impact of its own action towards the team’s success.
      Result in: Credit assignment problem


标题:Notes of Multi-agent deep reinforcement learning: a survey (1)
作者:Departure
地址:https://www.unreachablecity.club/articles/2024/09/20/1726841213866.html