-
Notifications
You must be signed in to change notification settings - Fork 0
/
CS 285 Lecture 4, Part 3.srt
584 lines (438 loc) · 12.5 KB
/
CS 285 Lecture 4, Part 3.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
1
00:00:01,280 --> 00:00:05,600
all right in the next portion of this lecture
2
00:00:03,840 --> 00:00:08,880
i'm going to introduce the notion of value functions
3
00:00:07,200 --> 00:00:11,840
which are a very useful mathematical object
4
00:00:10,480 --> 00:00:13,670
both for designing reinforcement learning algorithms
5
00:00:11,840 --> 00:00:17,920
and for conceptually thinking about the reinforcement learning objective
6
00:00:18,240 --> 00:00:23,110
so as i mentioned earlier
7
00:00:21,520 --> 00:00:28,000
the reinforcement objective can be defined as an expectation
8
00:00:25,110 --> 00:00:29,190
it's an expectation of a sum of rewards
9
00:00:28,000 --> 00:00:32,800
with respect to the trajectory distribution
10
00:00:30,080 --> 00:00:37,440
or equivalently a sum over time of the expected reward
11
00:00:33,840 --> 00:00:37,440
for every state action marginal
12
00:00:37,920 --> 00:00:47,920
now one of the things we could do with this expectation is we can actually write it out recursively
13
00:00:46,480 --> 00:00:51,280
so you know how we can apply the chain rule of probability to factorize the
14
00:00:49,600 --> 00:00:55,440
trajectory distribution as a product of many distributions
15
00:00:52,710 --> 00:01:01,350
in the same way we can apply the chain rule and write out an expected value with respect to that distribution
16
00:00:58,710 --> 00:01:03,280
as a series of nested expectations
17
00:01:01,350 --> 00:01:09,280
so the outermost expectation here would be over p of s1
18
00:01:06,960 --> 00:01:12,000
inside of it we have an expected value with respect to a1
19
00:01:10,000 --> 00:01:14,790
distributed according to pi of a1 given s1
20
00:01:13,040 --> 00:01:19,040
and now since we have an expectation for both s1 and a1
21
00:01:16,080 --> 00:01:21,920
we can put in the first reward r of s1 comma a1
22
00:01:20,150 --> 00:01:26,400
and notice that this inner expectation the one over a1 is conditioned on s1
23
00:01:24,880 --> 00:01:28,000
i have a bunch of blank space here
24
00:01:26,400 --> 00:01:32,400
because i'm going to need to put in all the other rewards
25
00:01:29,200 --> 00:01:34,790
but we already have r of s1a1
26
00:01:32,400 --> 00:01:36,000
now we add to that all the other rewards
27
00:01:34,790 --> 00:01:37,750
but those require
28
00:01:36,000 --> 00:01:41,200
putting in another expectation now over
29
00:01:37,750 --> 00:01:43,750
s2 distributed query to p of s2 given s1 a1
30
00:01:41,680 --> 00:01:45,750
so this expectation is conditioned on s1 a1
31
00:01:43,750 --> 00:01:47,920
and inside of that
32
00:01:45,750 --> 00:01:49,750
we have another expectation over a2
33
00:01:47,920 --> 00:01:53,110
distribute according to pi of a2 given s2
34
00:01:50,470 --> 00:01:57,110
and now since we have both s2 and a2 we can put in r of s2 a2
35
00:01:54,000 --> 00:02:00,240
and then we add to that the expected value over s3
36
00:01:58,790 --> 00:02:02,000
inside of which is the expected value over a3
37
00:02:00,240 --> 00:02:05,360
inside of it is r of s3 a3 and so on and so on
38
00:02:04,880 --> 00:02:10,080
and we have this these nested expectations
39
00:02:08,560 --> 00:02:13,760
now at first it kind of seems like we just wrote a very concise
40
00:02:11,760 --> 00:02:20,080
expected value of trajectories as a really really messy set of nested expectations
41
00:02:17,840 --> 00:02:23,920
but one thing that we could think about is well
42
00:02:21,440 --> 00:02:28,080
what if we had some some function that told us
43
00:02:25,040 --> 00:02:32,160
the stuff that goes inside of the x of the second expectation
44
00:02:30,160 --> 00:02:39,200
what if we had some function that tells us r of s1 plus comma a1 plus the expected value over s2 plus etc etc
45
00:02:36,000 --> 00:02:42,950
so what if we we knew this part
46
00:02:39,200 --> 00:02:49,280
so let's uh define a symbol for this let's say that q of s1 comma a1
47
00:02:46,870 --> 00:02:54,870
is equal to r of s1 comma a1 plus the expectation over s2 of the expectation over a2 of r of s28 to ETC
48
00:02:53,760 --> 00:02:57,440
so basically just this middle part
49
00:02:55,680 --> 00:03:00,640
the part that goes inside the second set of square brackets
50
00:02:58,800 --> 00:03:03,360
i'm just going to call that q of s1 comma a1
51
00:03:03,840 --> 00:03:11,840
then we can write our original rl objective as simply the expected value
52
00:03:08,800 --> 00:03:16,800
over s1 of the expected value over a1 of q of s1 comma a1
53
00:03:15,200 --> 00:03:19,840
so it's just a little bit of symbolic manipulation a little bit of definition
54
00:03:20,080 --> 00:03:26,790
but the important point about this definition is that
55
00:03:23,510 --> 00:03:32,150
if you knew q of s1 comma a1 then optimizing the policy at the first
56
00:03:29,510 --> 00:03:34,150
time step would be very easy
57
00:03:32,150 --> 00:03:37,510
so if you had access to a queue of s1 comma a1
58
00:03:35,200 --> 00:03:41,200
and you needed to select the policy pi of a1 given s1
59
00:03:39,040 --> 00:03:46,950
you would just select the policy for which this expected value is largest
60
00:03:44,400 --> 00:03:49,680
you could simply test every action and just
61
00:03:47,440 --> 00:03:56,560
assign 100 probability to the best one one with the largest value for q
62
00:03:53,920 --> 00:04:00,560
so this basic idea can be extended to a more general concept
63
00:03:58,790 --> 00:04:02,310
so this is this is the simple rule
64
00:04:00,560 --> 00:04:04,000
that i said you know a simple way to get
65
00:04:02,310 --> 00:04:09,760
pi here is just assign a probability of one to the arc max
66
00:04:07,040 --> 00:04:10,790
so the the more general principle is
67
00:04:09,760 --> 00:04:16,000
what we're going to call the q function
68
00:04:14,400 --> 00:04:21,680
so the q function can be defined at other time steps not just time step one
69
00:04:18,320 --> 00:04:25,360
and the definition is this
70
00:04:21,680 --> 00:04:30,400
q pi of st comma a t and i say q pi because it depends on pi
71
00:04:26,880 --> 00:04:37,120
q pi of st comma 80 is equal to the sum over all time steps from t
72
00:04:33,520 --> 00:04:42,880
until the end capital t of the expected value
73
00:04:38,630 --> 00:04:48,240
of the reward at that future time step
74
00:04:42,880 --> 00:04:48,240
condition on starting an st comma at.
75
00:04:49,440 --> 00:04:52,720
so what that means is basically if you
76
00:04:50,960 --> 00:04:55,910
start an st comma at and then roll out your policy
77
00:04:54,080 --> 00:05:00,240
for the rest of time will be the expected sum of rewards
78
00:04:58,720 --> 00:05:05,750
a closely related quantity that we can also define is something called the value function
79
00:05:03,280 --> 00:05:09,190
the value function is defined in much the same way only it's conditioned
80
00:05:07,360 --> 00:05:10,960
only a state rather than a state in action
81
00:05:09,190 --> 00:05:14,080
so the value function says if you start in state st
82
00:05:12,320 --> 00:05:17,190
and then roll out your policy what will
83
00:05:14,080 --> 00:05:19,280
be your expected total value
84
00:05:17,190 --> 00:05:21,440
and the value function can also be written
85
00:05:19,280 --> 00:05:25,360
as the expected value over actions of the q function
86
00:05:22,160 --> 00:05:30,240
because if the q function tells you they expect the total reward
87
00:05:27,600 --> 00:05:35,030
if you start an st comma a t then taking the expectation of that with respect to a t
88
00:05:32,720 --> 00:05:40,720
will give you the expected total reward if you start an st
89
00:05:38,400 --> 00:05:45,750
so now one observation we could make is the expectation of the value function
90
00:05:42,800 --> 00:05:49,030
at state s1 is the entirety of the reinforcement learning objective
91
00:05:47,440 --> 00:05:54,630
for the same reason that the expected value with respect to s1a1 of q s1a1
92
00:05:52,630 --> 00:05:56,960
was the rl objective on the previous slide
93
00:05:57,030 --> 00:06:02,400
okay so at this point i would like everyone to pause for a minute
94
00:06:00,800 --> 00:06:06,080
and think about these definitions of q functions and value functions
95
00:06:04,470 --> 00:06:11,280
you might want to flip back to the previous slide if something here is unclear
96
00:06:08,240 --> 00:06:12,720
take a moment to think about that and
97
00:06:11,280 --> 00:06:17,360
if something about these definitions is unclear please make sure to write a question in the comments
98
00:06:18,240 --> 00:06:24,080
all right let's continue
99
00:06:22,310 --> 00:06:26,630
so what are q functions and value functions good for
100
00:06:24,080 --> 00:06:29,600
well i provided some intuition for this a couple slides ago
101
00:06:28,160 --> 00:06:32,960
when i talked about how once you have a q function
102
00:06:31,190 --> 00:06:34,400
for at least the first time step
103
00:06:32,960 --> 00:06:36,960
you can recover a better policy for the first time step
104
00:06:37,910 --> 00:06:42,720
so one idea is that if we have a policy pi
105
00:06:40,080 --> 00:06:46,840
and if we can figure out its full q function q pi s comma a
106
00:06:45,910 --> 00:06:50,240
then we can improve pi
107
00:06:46,840 --> 00:06:56,160
for example we can pick a new policy pi prime that assigns a probability of 1
108
00:06:54,080 --> 00:06:57,440
to a given action if that action is the
109
00:06:56,160 --> 00:06:59,590
r max of q pi s a
110
00:06:57,440 --> 00:07:03,360
and we can do this on not just the first time step
111
00:07:00,630 --> 00:07:05,120
but on all of the time steps and in fact
112
00:07:03,360 --> 00:07:08,960
we can show that this policy is at least as good as pi
113
00:07:06,160 --> 00:07:09,360
and probably better
114
00:07:08,960 --> 00:07:12,560
don't worry if it's not obvious to you right now why this is true
115
00:07:10,800 --> 00:07:15,520
we'll cover this in much more detail later
116
00:07:13,680 --> 00:07:19,520
but this is the basis of a class of methods called policy iteration algorithms
117
00:07:17,680 --> 00:07:22,560
which themselves can be used to derive q-learning algorithms
118
00:07:22,960 --> 00:07:30,080
and crucially it doesn't matter what pi is you can always improve it in this way
119
00:07:27,750 --> 00:07:35,440
another idea which we will use in the next lecture when we talk about policy gradients
120
00:07:33,120 --> 00:07:37,840
is you can use this to compute a gradient
121
00:07:36,160 --> 00:07:43,030
to increase the probability of a good action a
122
00:07:40,000 --> 00:07:50,630
so the intuition is that if q pi s a is larger than v of s
123
00:07:47,120 --> 00:07:50,630
then a is better than average
124
00:07:52,400 --> 00:08:00,080
because remember that the v pi of s is just the expected value of q pi s a under pi of a given s
125
00:07:59,190 --> 00:08:05,280
by this definition v pi of s is how you will do on average
126
00:08:03,190 --> 00:08:08,310
when you use your policy from state s
127
00:08:06,470 --> 00:08:10,000
so if you can do better than average
128
00:08:08,310 --> 00:08:12,240
if you can choose an action a
129
00:08:10,000 --> 00:08:13,750
so that q pi s a is larger than v pi of s
130
00:08:12,240 --> 00:08:17,280
then you will do better you'll do better than average
131
00:08:14,800 --> 00:08:18,630
under your old policy so one thing you
132
00:08:17,280 --> 00:08:20,720
could do is you could modify
133
00:08:18,630 --> 00:08:24,560
pi of a given s to increase the probability of actions
134
00:08:22,870 --> 00:08:27,360
whose value under the q function is larger
135
00:08:24,560 --> 00:08:28,720
than the value at that state
136
00:08:27,360 --> 00:08:34,950
and you can actually use this to get a gradient based update rule on pi
137
00:08:30,630 --> 00:08:36,710
these ideas are very important in rl
138
00:08:34,950 --> 00:08:39,200
and we'll revisit them again and again in the next few lectures
139
00:08:38,000 --> 00:08:42,470
when we talk about model free reinforcement learning algorithms
140
00:08:43,360 --> 00:08:51,040
all right so in the anatomy of the reinforcement learning algorithm
141
00:08:48,800 --> 00:08:57,440
the green box is typically where you would use or where you would learn
142
00:08:55,680 --> 00:08:58,880
q functions or value functions
143
00:08:57,440 --> 00:09:02,240
so q functions and value functions fundamentally are objects that evaluate
144
00:09:00,560 --> 00:09:04,560
how good your policy currently is
145
00:09:02,240 --> 00:09:07,760
so you would typically fit them or learn them in the green box
146
00:09:05,830 --> 00:09:12,480
and then use them in the blue box to improve the policy