English_srt/CS 285 Lecture 4, Part 3.srt

1
00:00:01,280 --> 00:00:05,600
all right in the next portion of this lecture

2
00:00:03,840 --> 00:00:08,880
i'm going to introduce the notion of value functions

3
00:00:07,200 --> 00:00:11,840
which are a very useful mathematical object

4
00:00:10,480 --> 00:00:13,670
both for designing reinforcement learning algorithms

5
00:00:11,840 --> 00:00:17,920
and for conceptually thinking about the reinforcement learning objective

6
00:00:18,240 --> 00:00:23,110
so as i mentioned earlier

7
00:00:21,520 --> 00:00:28,000
the reinforcement objective can be defined as an expectation

8
00:00:25,110 --> 00:00:29,190
it's an expectation of a sum of rewards

9
00:00:28,000 --> 00:00:32,800
with respect to the trajectory distribution

10
00:00:30,080 --> 00:00:37,440
or equivalently a sum over time of the expected reward

11
00:00:33,840 --> 00:00:37,440
for every state action marginal

12
00:00:37,920 --> 00:00:47,920
now one of the things we could do with this expectation is we can actually write it out recursively

13
00:00:46,480 --> 00:00:51,280
so you know how we can apply the chain rule of probability to factorize the

14
00:00:49,600 --> 00:00:55,440
trajectory distribution as a product of many distributions

15
00:00:52,710 --> 00:01:01,350
in the same way we can apply the chain rule and write out an expected value with respect to that distribution

16
00:00:58,710 --> 00:01:03,280
as a series of nested expectations

17
00:01:01,350 --> 00:01:09,280
so the outermost expectation here would be over p of s1

18
00:01:06,960 --> 00:01:12,000
inside of it we have an expected value with respect to a1

19
00:01:10,000 --> 00:01:14,790
distributed according to pi of a1 given s1

20
00:01:13,040 --> 00:01:19,040
and now since we have an expectation for both s1 and a1

21
00:01:16,080 --> 00:01:21,920
we can put in the first reward r of s1 comma a1

22
00:01:20,150 --> 00:01:26,400
and notice that this inner expectation the one over a1 is conditioned on s1

23
00:01:24,880 --> 00:01:28,000
i have a bunch of blank space here

24
00:01:26,400 --> 00:01:32,400
because i'm going to need to put in all the other rewards

25
00:01:29,200 --> 00:01:34,790
but we already have r of s1a1

26
00:01:32,400 --> 00:01:36,000
now we add to that all the other rewards

27
00:01:34,790 --> 00:01:37,750
but those require

28
00:01:36,000 --> 00:01:41,200
putting in another expectation now over

29
00:01:37,750 --> 00:01:43,750
s2 distributed query to p of s2 given s1 a1

30
00:01:41,680 --> 00:01:45,750
so this expectation is conditioned on s1 a1

31
00:01:43,750 --> 00:01:47,920
and inside of that

32
00:01:45,750 --> 00:01:49,750
we have another expectation over a2

33
00:01:47,920 --> 00:01:53,110
distribute according to pi of a2 given s2

34
00:01:50,470 --> 00:01:57,110
and now since we have both s2 and a2 we can put in r of s2 a2

35
00:01:54,000 --> 00:02:00,240
and then we add to that the expected value over s3

36
00:01:58,790 --> 00:02:02,000
inside of which is the expected value over a3

37
00:02:00,240 --> 00:02:05,360
inside of it is r of s3 a3 and so on and so on

38
00:02:04,880 --> 00:02:10,080
and we have this these nested expectations

39
00:02:08,560 --> 00:02:13,760
now at first it kind of seems like we just wrote a very concise

40
00:02:11,760 --> 00:02:20,080
expected value of trajectories as a really really messy set of nested expectations

41
00:02:17,840 --> 00:02:23,920
but one thing that we could think about is well

42
00:02:21,440 --> 00:02:28,080
what if we had some some function that told us

43
00:02:25,040 --> 00:02:32,160
the stuff that goes inside of the x of the second expectation

44
00:02:30,160 --> 00:02:39,200
what if we had some function that tells us r of s1 plus comma a1 plus the expected value over s2 plus etc etc

45
00:02:36,000 --> 00:02:42,950
so what if we we knew this part

46
00:02:39,200 --> 00:02:49,280
so let's uh define a symbol for this let's say that q of s1 comma a1

47
00:02:46,870 --> 00:02:54,870
is equal to r of s1 comma a1 plus the expectation over s2 of the expectation over a2 of r of s28 to ETC

48
00:02:53,760 --> 00:02:57,440
so basically just this middle part

49
00:02:55,680 --> 00:03:00,640
the part that goes inside the second set of square brackets

50
00:02:58,800 --> 00:03:03,360
i'm just going to call that q of s1 comma a1

51
00:03:03,840 --> 00:03:11,840
then we can write our original rl objective as simply the expected value

52
00:03:08,800 --> 00:03:16,800
over s1 of the expected value over a1 of q of s1 comma a1

53
00:03:15,200 --> 00:03:19,840
so it's just a little bit of symbolic manipulation a little bit of definition

54
00:03:20,080 --> 00:03:26,790
but the important point about this definition is that

55
00:03:23,510 --> 00:03:32,150
if you knew q of s1 comma a1 then optimizing the policy at the first

56
00:03:29,510 --> 00:03:34,150
time step would be very easy

57
00:03:32,150 --> 00:03:37,510
so if you had access to a queue of s1 comma a1

58
00:03:35,200 --> 00:03:41,200
and you needed to select the policy pi of a1 given s1

59
00:03:39,040 --> 00:03:46,950
you would just select the policy for which this expected value is largest

60
00:03:44,400 --> 00:03:49,680
you could simply test every action and just

61
00:03:47,440 --> 00:03:56,560
assign 100 probability to the best one one with the largest value for q

62
00:03:53,920 --> 00:04:00,560
so this basic idea can be extended to a more general concept

63
00:03:58,790 --> 00:04:02,310
so this is this is the simple rule

64
00:04:00,560 --> 00:04:04,000
that i said you know a simple way to get

65
00:04:02,310 --> 00:04:09,760
pi here is just assign a probability of one to the arc max

66
00:04:07,040 --> 00:04:10,790
so the the more general principle is

67
00:04:09,760 --> 00:04:16,000
what we're going to call the q function

68
00:04:14,400 --> 00:04:21,680
so the q function can be defined at other time steps not just time step one

69
00:04:18,320 --> 00:04:25,360
and the definition is this

70
00:04:21,680 --> 00:04:30,400
q pi of st comma a t and i say q pi because it depends on pi

71
00:04:26,880 --> 00:04:37,120
q pi of st comma 80 is equal to the sum over all time steps from t

72
00:04:33,520 --> 00:04:42,880
until the end capital t of the expected value

73
00:04:38,630 --> 00:04:48,240
of the reward at that future time step

74
00:04:42,880 --> 00:04:48,240
condition on starting an st comma at.

75
00:04:49,440 --> 00:04:52,720
so what that means is basically if you

76
00:04:50,960 --> 00:04:55,910
start an st comma at and then roll out your policy

77
00:04:54,080 --> 00:05:00,240
for the rest of time will be the expected sum of rewards

78
00:04:58,720 --> 00:05:05,750
a closely related quantity that we can also define is something called the value function

79
00:05:03,280 --> 00:05:09,190
the value function is defined in much the same way only it's conditioned

80
00:05:07,360 --> 00:05:10,960
only a state rather than a state in action

81
00:05:09,190 --> 00:05:14,080
so the value function says if you start in state st

82
00:05:12,320 --> 00:05:17,190
and then roll out your policy what will

83
00:05:14,080 --> 00:05:19,280
be your expected total value

84
00:05:17,190 --> 00:05:21,440
and the value function can also be written

85
00:05:19,280 --> 00:05:25,360
as the expected value over actions of the q function

86
00:05:22,160 --> 00:05:30,240
because if the q function tells you they expect the total reward

87
00:05:27,600 --> 00:05:35,030
if you start an st comma a t then taking the expectation of that with respect to a t

88
00:05:32,720 --> 00:05:40,720
will give you the expected total reward if you start an st

89
00:05:38,400 --> 00:05:45,750
so now one observation we could make is the expectation of the value function

90
00:05:42,800 --> 00:05:49,030
at state s1 is the entirety of the reinforcement learning objective

91
00:05:47,440 --> 00:05:54,630
for the same reason that the expected value with respect to s1a1 of q s1a1

92
00:05:52,630 --> 00:05:56,960
was the rl objective on the previous slide

93
00:05:57,030 --> 00:06:02,400
okay so at this point i would like everyone to pause for a minute

94
00:06:00,800 --> 00:06:06,080
and think about these definitions of q functions and value functions

95
00:06:04,470 --> 00:06:11,280
you might want to flip back to the previous slide if something here is unclear

96
00:06:08,240 --> 00:06:12,720
take a moment to think about that and

97
00:06:11,280 --> 00:06:17,360
if something about these definitions is unclear please make sure to write a question in the comments

98
00:06:18,240 --> 00:06:24,080
all right let's continue

99
00:06:22,310 --> 00:06:26,630
so what are q functions and value functions good for

100
00:06:24,080 --> 00:06:29,600
well i provided some intuition for this a couple slides ago

101
00:06:28,160 --> 00:06:32,960
when i talked about how once you have a q function

102
00:06:31,190 --> 00:06:34,400
for at least the first time step

103
00:06:32,960 --> 00:06:36,960
you can recover a better policy for the first time step

104
00:06:37,910 --> 00:06:42,720
so one idea is that if we have a policy pi

105
00:06:40,080 --> 00:06:46,840
and if we can figure out its full q function q pi s comma a

106
00:06:45,910 --> 00:06:50,240
then we can improve pi

107
00:06:46,840 --> 00:06:56,160
for example we can pick a new policy pi prime that assigns a probability of 1

108
00:06:54,080 --> 00:06:57,440
to a given action if that action is the

109
00:06:56,160 --> 00:06:59,590
r max of q pi s a

110
00:06:57,440 --> 00:07:03,360
and we can do this on not just the first time step

111
00:07:00,630 --> 00:07:05,120
but on all of the time steps and in fact

112
00:07:03,360 --> 00:07:08,960
we can show that this policy is at least as good as pi

113
00:07:06,160 --> 00:07:09,360
and probably better

114
00:07:08,960 --> 00:07:12,560
don't worry if it's not obvious to you right now why this is true

115
00:07:10,800 --> 00:07:15,520
we'll cover this in much more detail later

116
00:07:13,680 --> 00:07:19,520
but this is the basis of a class of methods called policy iteration algorithms

117
00:07:17,680 --> 00:07:22,560
which themselves can be used to derive q-learning algorithms

118
00:07:22,960 --> 00:07:30,080
and crucially it doesn't matter what pi is you can always improve it in this way

119
00:07:27,750 --> 00:07:35,440
another idea which we will use in the next lecture when we talk about policy gradients

120
00:07:33,120 --> 00:07:37,840
is you can use this to compute a gradient

121
00:07:36,160 --> 00:07:43,030
to increase the probability of a good action a

122
00:07:40,000 --> 00:07:50,630
so the intuition is that if q pi s a is larger than v of s

123
00:07:47,120 --> 00:07:50,630
then a is better than average

124
00:07:52,400 --> 00:08:00,080
because remember that the v pi of s is just the expected value of q pi s a under pi of a given s

125
00:07:59,190 --> 00:08:05,280
by this definition v pi of s is how you will do on average

126
00:08:03,190 --> 00:08:08,310
when you use your policy from state s

127
00:08:06,470 --> 00:08:10,000
so if you can do better than average

128
00:08:08,310 --> 00:08:12,240
if you can choose an action a

129
00:08:10,000 --> 00:08:13,750
so that q pi s a is larger than v pi of s

130
00:08:12,240 --> 00:08:17,280
then you will do better you'll do better than average

131
00:08:14,800 --> 00:08:18,630
under your old policy so one thing you

132
00:08:17,280 --> 00:08:20,720
could do is you could modify

133
00:08:18,630 --> 00:08:24,560
pi of a given s to increase the probability of actions

134
00:08:22,870 --> 00:08:27,360
whose value under the q function is larger

135
00:08:24,560 --> 00:08:28,720
than the value at that state

136
00:08:27,360 --> 00:08:34,950
and you can actually use this to get a gradient based update rule on pi

137
00:08:30,630 --> 00:08:36,710
these ideas are very important in rl

138
00:08:34,950 --> 00:08:39,200
and we'll revisit them again and again in the next few lectures

139
00:08:38,000 --> 00:08:42,470
when we talk about model free reinforcement learning algorithms

140
00:08:43,360 --> 00:08:51,040
all right so in the anatomy of the reinforcement learning algorithm

141
00:08:48,800 --> 00:08:57,440
the green box is typically where you would use or where you would learn

142
00:08:55,680 --> 00:08:58,880
q functions or value functions

143
00:08:57,440 --> 00:09:02,240
so q functions and value functions fundamentally are objects that evaluate

144
00:09:00,560 --> 00:09:04,560
how good your policy currently is

145
00:09:02,240 --> 00:09:07,760
so you would typically fit them or learn them in the green box

146
00:09:05,830 --> 00:09:12,480
and then use them in the blue box to improve the policy