Markov decision process

In mathematics, a Markov decision process (MDP) is a discrete-time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying optimization problems solved via dynamic programming. MDPs were known at least as early as the 1950s;^[1] a core body of research on Markov decision processes resulted from Ronald Howard's 1960 book, Dynamic Programming and Markov Processes.^[2] They are used in many disciplines, including robotics, automatic control, economics and manufacturing. The name of MDPs comes from the Russian mathematician Andrey Markov as they are an extension of Markov chains.

At each time step, the process is in some state $s$ , and the decision maker may choose any action $a$ that is available in state $s$ . The process responds at the next time step by randomly moving into a new state $s'$ , and giving the decision maker a corresponding reward $R_{a}(s,s')$ .

The probability that the process moves into its new state $s'$ is influenced by the chosen action. Specifically, it is given by the state transition function $P_{a}(s,s')$ . Thus, the next state $s'$ depends on the current state $s$ and the decision maker's action $a$ . But given $s$ and $a$ , it is conditionally independent of all previous states and actions; in other words, the state transitions of an MDP satisfy the Markov property.

Markov decision processes are an extension of Markov chains; the difference is the addition of actions (allowing choice) and rewards (giving motivation). Conversely, if only one action exists for each state (e.g. "wait") and all rewards are the same (e.g. "zero"), a Markov decision process reduces to a Markov chain.

$S$ is a of states called the state space,

set

$A$ is a set of actions called the action space (alternatively, $A_{s}$ is the set of actions available from state $s$ ),

$P_{a}(s,s')=\Pr(s_{t+1}=s'\mid s_{t}=s,a_{t}=a)$ is the probability that action $a$ in state $s$ at time $t$ will lead to state $s'$ at time $t+1$ ,

$R_{a}(s,s')$ is the immediate reward (or expected immediate reward) received after transitioning from state $s$ to state $s'$ , due to action $a$

Computational complexity[edit]

Algorithms for finding optimal policies with time complexity polynomial in the size of the problem representation exist for finite MDPs. Thus, decision problems based on MDPs are in computational complexity class P.^[11] However, due to the curse of dimensionality, the size of the problem representation is often exponential in the number of state and action variables, limiting exact solution techniques to problems that have a compact representation. In practice, online planning techniques such as Monte Carlo tree search can find useful solutions in larger problems, and, in theory, it is possible to construct online planning algorithms that can find an arbitrarily near-optimal policy with no computational complexity dependence on the size of the state space.^[12]

a set x of possible inputs,

a set Φ = { Φ₁, ..., Φ_s } of possible internal states,

a set α = { α₁, ..., α_r } of possible outputs, or actions, with r ≤ s,

an initial state probability vector p(0) = ≪ p₁(0), ..., p_s(0) ≫,

a A which after each time step t generates p(t + 1) from p(t), the current input, and the current state, and

computable function

a function G: Φ → α which generates the output at each time step.

${\mathcal {S}}$ : State space;

${\mathcal {A}}$ : Action space;

$q(i\mid j,a)$ : ${\mathcal {S}}\times {\mathcal {A}}\rightarrow \triangle {\mathcal {S}}$ , transition rate function;

$R(i,a)$ : ${\mathcal {S}}\times {\mathcal {A}}\rightarrow \mathbb {R}$ , a reward function.

There are multiple costs incurred after applying an action instead of one.

CMDPs are solved with only, and dynamic programming does not work.

linear programs

The final policy depends on the starting state.

Constrained Markov decision processes (CMDPS) are extensions to Markov decision process (MDPs). There are three fundamental differences between MDPs and CMDPs.^[17]

The method of Lagrange multipliers applies to CMDPs. Many Lagrangian-based algorithms have been developed.

There are a number of applications for CMDPs. It has recently been used in motion planning scenarios in robotics.^[19]

Bellman., R. E. (2003) [1957]. Dynamic Programming (Dover paperback ed.). Princeton, NJ: Princeton University Press. 978-0-486-42809-3.

ISBN

Bertsekas, D. (1995). Dynamic Programming and Optimal Control. Vol. 2. MA: Athena.

Derman, C. (1970). Finite state Markovian decision processes. Academic Press.

Feinberg, E.A.; Shwartz, A., eds. (2002). . Boston, MA: Kluwer. ISBN 9781461508052.

Handbook of Markov Decision Processes

Guo, X.; Hernández-Lerma, O. (2009). . Stochastic Modelling and Applied Probability. Springer. ISBN 9783642025464.

Continuous-Time Markov Decision Processes

Meyn, S. P. (2007). . Cambridge University Press. ISBN 978-0-521-88441-9. Archived from the original on 19 June 2010. Appendix contains abridged "Meyn & Tweedie". Archived from the original on 18 December 2012.

Control Techniques for Complex Networks

Puterman., M. L. (1994). Markov Decision Processes. Wiley.

Ross, S. M. (1983). (PDF). Academic press.

Introduction to stochastic dynamic programming

Sutton, R. S.; Barto, A. G. (2017). . Cambridge, MA: The MIT Press.

Reinforcement Learning: An Introduction

Tijms., H.C. (2003). . Wiley. ISBN 9780470864289.

A First Course in Stochastic Models

by Satinder P. Singh