Outlook | Temp (F) | Humidity (%) | Windy? | Class |
sunny | 75 | 70 | true | Play |
sunny | 80 | 90 | true | Don't Play |
sunny | 85 | 85 | false | Don't Play |
sunny | 72 | 95 | false | Don't Play |
sunny | 69 | 70 | false | Play |
overcast | 72 | 90 | true | Play |
overcast | 83 | 78 | false | Play |
overcast | 64 | 65 | true | Play |
overcast | 81 | 75 | false | Play |
rain | 71 | 80 | true | Don't Play |
rain | 65 | 70 | true | Don't Play |
rain | 75 | 80 | false | Play |
rain | 68 | 80 | false | Play |
rain | 70 | 96 | false | Play |
The diagram below shows a gridworld domain in which the agent starts at the upper left location. The upper and lower rows are both "one-way streets," since only the actions shown by arrows are available.
Actions that attempt to move the agent
into a wall (the outer borders, or the thick black wall
between all but the leftmost cell of the top and bottom
rows) leave the agent in the same state it was in with
probability 1, and have reward -2.
If the agent tries to move to the right from the upper right
or lower right locations, with probability 1,
it is teleported to the far left end
of the corresponding row, with reward as marked.
All other actions have
the expected effect (move up, down, left, or right)
with probability .9, and leave the agent in the same state it was
in with probability .1. These actions all have utility -1,
except for the transitions that are marked. (Note that the
marked transitions only give the indicated reward if the action
succeeds in moving the agent in that direction.)
(a) MDP (10 pts) Give the MDP for this domain
only for the state transitions starting from each of
the states in the top row, by filling in a
state-action-state
transition table (showing only the state transitions
with non-zero probability). You should refer to each state
by its row and column index, so the upper left state is [1,1]
and the lower right state is [2,4].
To get you started, here are the first few lines of the table:
State s | Action a | New state s' | p(s'|s,a) | r(s,a,s') |
[1,1] | Up | [1,1] | 1.0 | -2 |
[1,1] | Right | [1,1] | 0.1 | -1 |
[1,1] | Right | [1,2] | 0.9 | +20 |
(b) Value function (8 pts)
Suppose the agent follows a randomized policy π (where each
available action in any given state has equal probability)
and uses a discount factor of γ=.85.
Given the partial value function (Vπ; Uπ in
Russell & Norvig's terminology) shown below,
fill in the missing Vπ values. Show and explain your work.
(c) Policy (7 pts)
Given the value function Vπ computed in (b),
what new policy π' would policy iteration produce at
the next iteration? Show your answer as a diagram (arrows
on the grid) or as a state-action table.
(a) Iterated Prisoner's Dilemma (4 pts)
C | D | |
C | 3,3 | 0,5 |
D | 5,0 | 1,1 |
(b) Rock-Paper-Scissors (4 pts)
Also called "Roshambo," each player chooses to present one of three objects: rock, paper, or scissors. Rock breaks (beats) scissors; paper covers (beats) rock; scissors cuts (beats) paper. Nobody wins (it's a tie) if both players pick the same object.
R | P | S | |
R | 0,0 | -1,+1 | +1,-1 |
P | +1,-1 | 0,0 | -1,+1 |
R | -1,+1 | +1,-1 | 0,0 |
(c) Chicken (4 pts)
Two drivers are headed for a one-lane bridge. If they both swerve out of the way, they "tie" (nobody scores). If one swerves and the other drives straight on, the "chicken" loses a point and the gutsy driver gains a point. If neither swerves, they both lose big.
Straight | Swerve | |
Straight | -10,-10 | +1,-1 |
Swerve | -1,+1 | 0,0 |
State the Nash equilibria of your game and explain why they are the equilibria. Also indicate what the social welfare maximizing strategy sets are for your game. Will rational players maximize social welfare in your game?
Do you think that the "tit-for-tat" strategy, which has been used successfully with the IPD, would work well for a player in your game? Why or why not?