Anonymous

introduce the idea of Markov Decision Processes (MDPs) as a way of formalizing what it means to make optimal decisions in probabilistic domains. MDPs also generalize the idea of having a single goal state to instead having reward, positive or negative, that can accumulate at various states.

Take the grid world from last week’s Colab notebook (or another domain if you have something you really prefer).

Add some amount of probabilistic behavior and reward to this environment and model it as a Markov Decision Problem (MDP). See this week’s Colab notebook (Links to an external site.) for an example of a betting game modeled as an MDP.

For example: maybe the environment is slippery, and actions sometimes don’t have the desired effects. Maybe some squares give negative reward some percentage of the time (traps?). Maybe all squares give negative reward some percentage of the time (meteorite?). Maybe some walls are electrified? Etc.

The required to write down

Write down how this would be modeled as an MDP:

• States
• Actions in each state
• Transition function, i.e. probability that an action in a state will produce a given successor state
• Reward function, i.e., which transitions produce a reward, and how much?

You don’t have to code this, just model the problem.

Do you have a guess what the optimal value function and policy should look like?

12 times new roman single space and as needed

MDP modeling exercise