-
Notifications
You must be signed in to change notification settings - Fork 1
/
15295434052821.html
450 lines (262 loc) · 20.5 KB
/
15295434052821.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
<!doctype html>
<html class="no-js" lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>
Open Sourcing Zuul 2 - Junkman
</title>
<link href="atom.xml" rel="alternate" title="Junkman" type="application/atom+xml">
<link rel="stylesheet" href="asset/css/foundation.min.css" />
<link rel="stylesheet" href="asset/css/docs.css" />
<script src="asset/js/vendor/modernizr.js"></script>
<script src="asset/js/vendor/jquery.js"></script>
<script src="asset/highlightjs/highlight.pack.js"></script>
<link href="asset/highlightjs/styles/github.css" media="screen, projection" rel="stylesheet" type="text/css">
<script>hljs.initHighlightingOnLoad();</script>
<script type="text/javascript">
function before_search(){
var searchVal = 'site:panlw.github.io ' + document.getElementById('search_input').value;
document.getElementById('search_q').value = searchVal;
return true;
}
</script>
</head>
<body class="antialiased hide-extras">
<div class="marketing off-canvas-wrap" data-offcanvas>
<div class="inner-wrap">
<nav class="top-bar docs-bar hide-for-small" data-topbar>
<section class="top-bar-section">
<div class="row">
<div style="position: relative;width:100%;"><div style="position: absolute; width:100%;">
<ul id="main-menu" class="left">
<li id=""><a target="self" href="index.html">Home</a></li>
<li id=""><a target="_self" href="archives.html">Archives</a></li>
</ul>
<ul class="right" id="search-wrap">
<li>
<form target="_blank" onsubmit="return before_search();" action="http://google.com/search" method="get">
<input type="hidden" id="search_q" name="q" value="" />
<input tabindex="1" type="search" id="search_input" placeholder="Search"/>
</form>
</li>
</ul>
</div></div>
</div>
</section>
</nav>
<nav class="tab-bar show-for-small">
<a href="javascript:void(0)" class="left-off-canvas-toggle menu-icon">
<span> Junkman</span>
</a>
</nav>
<aside class="left-off-canvas-menu">
<ul class="off-canvas-list">
<li><a href="index.html">HOME</a></li>
<li><a href="archives.html">Archives</a></li>
<li><a href="about.html">ABOUT</a></li>
<li><label>Categories</label></li>
<li><a href="Infra.html">Infra</a></li>
<li><a href="Coding.html">Coding</a></li>
<li><a href="Modeling.html">Modeling</a></li>
<li><a href="Archtecting.html">Archtecting</a></li>
</ul>
</aside>
<a class="exit-off-canvas" href="#"></a>
<section id="main-content" role="main" class="scroll-container">
<script type="text/javascript">
$(function(){
$('#menu_item_index').addClass('is_active');
});
</script>
<div class="row">
<div class="large-8 medium-8 columns">
<div class="markdown-body article-wrap">
<div class="article">
<h1>Open Sourcing Zuul 2</h1>
<div class="read-more clearfix">
<span class="date">2018/6/21</span>
<span>posted in </span>
<span class="posted-in"><a href='Microservice.html'>Microservice</a></span>
<span class="comments">
</span>
</div>
</div><!-- article -->
<div class="article-content">
<blockquote>
<p><a href="https://medium.com/netflix-techblog/open-sourcing-zuul-2-82ea476cb2b3">https://medium.com/netflix-techblog/open-sourcing-zuul-2-82ea476cb2b3</a></p>
</blockquote>
<p>We are excited to announce the open sourcing of <a href="https://github.com/netflix/zuul/">Zuul 2</a>, Netflix’s cloud gateway. We use Zuul 2 at Netflix as the front door for all requests coming into Netflix’s cloud infrastructure. Zuul 2 significantly improves the architecture and features that allow our gateway to handle, route, and protect Netflix’s cloud systems, and helps provide our 125 million members the best experience possible. The Cloud Gateway team at Netflix runs and operates more than 80 clusters of Zuul 2, sending traffic to about 100 (and growing) backend service clusters which amounts to more than 1 million requests per second. Nearly all of this traffic is from customer devices and browsers that enable the discovery and playback experience you are likely familiar with.</p>
<p>This post will overview Zuul 2, provide details on some of the interesting features we are releasing today, and discuss some of the other projects that we’re building with Zuul 2.</p>
<h3 id="toc_0">How Zuul 2 Works</h3>
<p>For context, here’s a high-level diagram of Zuul 2’s architecture:</p>
<p><img src="https://cdn-images-1.medium.com/max/1600/0*ycjEWsSKCaPemEg3." alt=""/></p>
<p>The Netty handlers on the front and back of the filters are mainly responsible for handling the network protocol, web server, connection management and proxying work. With those inner workings abstracted away, the filters do all of the heavy lifting. The inbound filters run before proxying the request and can be used for authentication, routing, or decorating the request. The endpoint filters can either be used to return a static response or proxy the request to the backend service (or origin as we call it). The outbound filters run after a response has been returned and can be used for things like gzipping, metrics, or adding/removing custom headers.</p>
<p>Zuul’s functionality depends almost entirely on the logic that you add in each filter. That means you can deploy it in multiple contexts and have it solve different problems based on the configurations and filters it is running.</p>
<p>We use Zuul at the entrypoint of all external traffic into Netflix’s cloud services and we’ve started using it for routing internal traffic, as well. We deploy the same core but with a substantially reduced amount of functionality (i.e. fewer filters). This allows us to leverage load balancing, self service routing, and resiliency features for internal traffic.</p>
<h3 id="toc_1">Open Source</h3>
<p>The Zuul code that’s running today is the most stable and resilient version of Zuul yet. The various phases of evolving and refactoring the codebase have paid dividends and we couldn’t be happier to share it with you.</p>
<p>Today we are releasing many <a href="https://github.com/Netflix/zuul/wiki/Core-Features">core features</a>. Here are the ones we’re most excited about:</p>
<h4 id="toc_2">Server Protocols</h4>
<ul>
<li> <strong>HTTP/2</strong> — full server support for inbound HTTP/2 connections</li>
<li> <strong>Mutual TLS</strong> — allow for running Zuul in more secure scenarios</li>
</ul>
<h4 id="toc_3">Resiliency Features</h4>
<ul>
<li> <strong>Adaptive Retries</strong> — the core retry logic that we use at Netflix to increase our resiliency and availability</li>
<li> <strong>Origin Concurrency Protection </strong>— configurable concurrency limits to protect your origins from getting overloaded and protect other origins behind Zuul from each other</li>
</ul>
<h4 id="toc_4">Operational Features</h4>
<ul>
<li> <strong>Request Passport </strong>— track all the lifecycle events for each request, which is invaluable for debugging async requests</li>
<li> <strong>Status Categories</strong> — an enumeration of possible success and failure states for requests that are more granular than HTTP status codes</li>
<li> <strong>Request Attempts</strong> — track proxy attempts and status of each, particularly useful for debugging retries and routing</li>
</ul>
<p>We are also working on some features that will be <a href="https://github.com/Netflix/zuul/wiki/Coming-Soon">coming soon</a>, including:</p>
<ul>
<li> <strong>Websocket/SSE </strong>— support for side-channel push notifications</li>
<li> <strong>Throttling and rate-limiting</strong> — protection from malicious client connections and requests, helping defend against volumetric attacks</li>
<li> <strong>Brownout filters</strong> — for disabling certain CPU-intensive features when Zuul is overloaded</li>
<li> <strong>Configurable routing</strong> — file-based routing configuration, instead of having to create routing filters in Zuul</li>
</ul>
<p>We would love to hear from you and see all the new and interesting applications of Zuul. For instructions on getting started, please visit our <a href="https://github.com/Netflix/zuul/wiki/Getting-Started-2.0">wiki page</a>.</p>
<h3 id="toc_5">Leveraging Zuul 2 at Netflix</h3>
<p>Internally, there are several major features that we’ve been working on but have not open sourced yet. Each one deserves its own blog post, but let’s go over them briefly.</p>
<h3 id="toc_6">Self Service Routing</h3>
<p>The most widely-used feature by our partners is self service routing. We provide an application and API for users to create routing rules based on any criteria in the request URL, path, query params, or headers. We then publish these routing rules to all the Zuul instances.</p>
<p>The main use case is for routing traffic to a specific test or staging cluster. However, there are many use cases for real production traffic. For example:</p>
<ul>
<li> Services needing to shard their traffic create routing rules that map certain paths or prefixes to separate origins</li>
<li> Developers onboard new services by creating a route that maps a new hostname to their new origin</li>
<li> Developers run load tests by routing a percentage of existing traffic to a small cluster and ensuring applications will degrade gracefully under load</li>
<li> Teams refactoring applications migrate to a new origin slowly by creating rules mapping traffic gradually, one path at a time</li>
<li> Teams test changes (canary testing) by sending a small percentage of traffic to an instrumented cluster running the new build</li>
<li> If teams need to test changes requiring multiple consecutive requests on their new build, they run sticky canary tests that route the same users to their new build for brief periods of time</li>
<li> Security teams create rules that reject “bad” requests based on path or header rules across all Zuul clusters</li>
</ul>
<p>As you can see we use self service routing extensively and are increasing the customizability and scope of routes to allow for even more use cases.</p>
<h3 id="toc_7">Load Balancing for Resiliency</h3>
<p>Another major feature we’ve worked on is making load balancing to origins more intelligent. We are able to route around failures, slowness, GC issues, and various other things that crop up often when running large amounts of nodes. The goal of this work is to increase resiliency, availability, and quality of service for all Netflix services.</p>
<p>We have several cases that we handle:</p>
<h4 id="toc_8">Cold Instances</h4>
<p>When new origin instances start up, we send them a reduced amount of traffic for some time, until they’re warmed up. This was an issue we observed for applications with large codebases and huge metaspace usage. It takes a significant amount of time for these apps to JIT their code and be ready to handle a large amount of traffic.</p>
<p>We also generally bias the traffic to older instances and if we happen to hit a cold instance that throttles, we can always retry on a warm one. This gives us an order of magnitude improvement in availability.</p>
<h4 id="toc_9">High Error Rates</h4>
<p>Errors happen all the time and for varying reasons, whether it’s because of a bug in the code, a bad instance, or an invalid configuration property being set. Fortunately, as a proxy, we can detect errors reliably — either we get a 5xx error or there are connectivity problems to the service.</p>
<p>We track error rates for each origin and if the error rate is high enough, it implies the entire service is in trouble. We throttle retries from devices and disable internal retries to allow the service to recover. Moreover, we also track successive failures per instance and blacklist the bad ones for a period of time.</p>
<h4 id="toc_10">Overloaded Instances</h4>
<p>With the above approaches we send less traffic to servers in a cluster that are throttling or refusing connections, and lessened the impact by retrying those failed requests on other servers.</p>
<p>We’re now rolling out an additional approach where we aim to avoid overloading servers in the first place. This is achieved by allowing origins to signal to Zuul their current utilization, which Zuul then uses as a factor in its load-balancing choices — leading to reduced error rates, retries, and latency.</p>
<p>The origins add a header to all responses stating their utilization as a percentage, along with a target utilization they would like to have across the cluster. Calculating the percentage is completely up to each application and engineers can use whatever metric suits them best. This allows for a general solution as opposed to us trying to come up with a one-size-fits-all approach.</p>
<p>With this functionality in place, we assign a score (combination of instance utilization and other factors like the ones above) to each instance and do a choice-of-two load balancing selection.</p>
<h3 id="toc_11">Anomaly Detection and Contextual Alerting</h3>
<p>As we grew from just a handful of origins to a new world where anyone can quickly spin up a container cluster and put it behind Zuul, we found there was a need to automatically detect and pinpoint origin failures.</p>
<p>With the help of <a href="https://medium.com/netflix-techblog/stream-processing-with-mantis-78af913f51a6">Mantis real time event streaming</a>, we built an anomaly detector that aggregates error rates per service and notifies us in real time when services are in trouble. It takes all of the anomalies in a given time window and creates a timeline of all the origins in trouble. We then create a contextual alert email with the timeline of events and services affected. This allows an operator to quickly correlate these events and orient themselves to debug a specific app or feature, and ultimately find the root cause.</p>
<p>In fact, it was so useful that we expanded it to send notifications to the origin teams themselves. We’ve also added more internal applications, other than Zuul, and can build a much more extensive timeline of events. This has been a huge help during production incidents and helps operators quickly detect and fix problems before they cascade into massive outages.</p>
<p>We hope to open source as many of the above features as we can. Keep watching the tech blog for more depth on them in the future. If you want to help us solve these kinds of problem, please check out our <a href="https://jobs.netflix.com/teams/engineering?team=Product%20Engineering&organization=Engineering">jobs site</a>.</p>
<p>— Arthur Gonigberg (@agonigberg), Mikey Cohen (@moldfarm ), Michael Smith (@kerumai ), Gaya Varadarajan ( @gaya3varadhu ), Sudheer Vinukonda ( @apachesudheerv ), Susheel Aroskar (@susheelaroskar )</p>
</div>
<div class="row">
<div class="large-6 columns">
<p class="text-left" style="padding:15px 0px;">
<a href="15298011985624.html"
title="Previous Post: Java中的十个"单行代码编程"(One Liner)">« Java中的十个"单行代码编程"(One Liner)</a>
</p>
</div>
<div class="large-6 columns">
<p class="text-right" style="padding:15px 0px;">
<a href="15293292942972.html"
title="Next Post: 深入理解 RPC 之协议篇">深入理解 RPC 之协议篇 »</a>
</p>
</div>
</div>
<div class="comments-wrap">
<div class="share-comments">
<script type="text/javascript" src="//s7.addthis.com/js/300/addthis_widget.js#pubid=ra-5ae58078c0d7b2ab"></script>
</div>
</div>
</div><!-- article-wrap -->
</div><!-- large 8 -->
<div class="large-4 medium-4 columns">
<div class="hide-for-small">
<div id="sidebar" class="sidebar">
<div id="site-info" class="site-info">
<div class="site-a-logo"><img src="./asset/img/logo.jpg" /></div>
<h1>Junkman</h1>
<div class="site-des">“拾荒者”一词来自凯文・凯利的《失控》中关于机器学习的故事(“收集癖好机”如何完成他的收集工作)。</div>
<div class="social">
<a target="_blank" class="github" target="_blank" href="https://github.com/panlw/" title="GitHub">GitHub</a>
<a target="_blank" class="rss" href="atom.xml" title="RSS">RSS</a>
</div>
</div>
<div id="site-categories" class="side-item ">
<div class="side-header">
<h2>Categories</h2>
</div>
<div class="side-content">
<p class="cat-list">
<a href="Infra.html"><strong>Infra</strong></a>
<a href="Coding.html"><strong>Coding</strong></a>
<a href="Modeling.html"><strong>Modeling</strong></a>
<a href="Archtecting.html"><strong>Archtecting</strong></a>
</p>
</div>
</div>
<div id="site-categories" class="side-item">
<div class="side-header">
<h2>Recent Posts</h2>
</div>
<div class="side-content">
<ul class="posts-list">
<li class="post">
<a href="15517999043443.html">The Art of Crafting Architectural Diagrams</a>
</li>
<li class="post">
<a href="15517997955971.html">为什么说我们需要软件架构图?</a>
</li>
<li class="post">
<a href="15516128677869.html">DNS Servers That Offer Privacy and Filtering</a>
</li>
<li class="post">
<a href="15516123108194.html">Airbnb's Migration from Monolith to Services</a>
</li>
<li class="post">
<a href="15516097487470.html">Events As First-Class Citizens</a>
</li>
</ul>
</div>
</div>
</div><!-- sidebar -->
</div><!-- hide for small -->
</div><!-- large 4 -->
</div><!-- row -->
<div class="page-bottom clearfix">
<div class="row">
<p class="copyright">Copyright © 2015
Powered by <a target="_blank" href="http://www.mweb.im">MWeb</a>,
Theme used <a target="_blank" href="http://github.com">GitHub CSS</a>.</p>
</div>
</div>
</section>
</div>
</div>
<script src="asset/js/foundation.min.js"></script>
<script>
$(document).foundation();
function fixSidebarHeight(){
var w1 = $('.markdown-body').height();
var w2 = $('#sidebar').height();
if (w1 > w2) { $('#sidebar').height(w1); };
}
$(function(){
fixSidebarHeight();
})
$(window).load(function(){
fixSidebarHeight();
});
</script>
<script src="asset/chart/all-min.js"></script><script type="text/javascript">$(function(){ var mwebii=0; var mwebChartEleId = 'mweb-chart-ele-'; $('pre>code').each(function(){ mwebii++; var eleiid = mwebChartEleId+mwebii; if($(this).hasClass('language-sequence')){ var ele = $(this).addClass('nohighlight').parent(); $('<div id="'+eleiid+'"></div>').insertAfter(ele); ele.hide(); var diagram = Diagram.parse($(this).text()); diagram.drawSVG(eleiid,{theme: 'simple'}); }else if($(this).hasClass('language-flow')){ var ele = $(this).addClass('nohighlight').parent(); $('<div id="'+eleiid+'"></div>').insertAfter(ele); ele.hide(); var diagram = flowchart.parse($(this).text()); diagram.drawSVG(eleiid); } });});</script>
<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script><script type="text/x-mathjax-config">MathJax.Hub.Config({TeX: { equationNumbers: { autoNumber: "AMS" } }});</script>
</body>
</html>