Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reverse proxy startup health check behavior results in 503 errors #6410

Open
elee1766 opened this issue Jun 19, 2024 · 5 comments
Open

Reverse proxy startup health check behavior results in 503 errors #6410

elee1766 opened this issue Jun 19, 2024 · 5 comments
Labels
discussion 💬 The right solution needs to be found

Comments

@elee1766
Copy link
Contributor

elee1766 commented Jun 19, 2024

Currently, a remote is marked unhealthy if no active health checks to the remote have been done.

this causes the reverse proxy to return 503 before a health check is completed, even if the remote is truly healthy, in the time between load competing and the first health check.

there are a few solutions to the problem, but we have not decided which is correct.

  1. block/hold requests if all remotes have no health history
  2. block the provisioning of caddy until one health check round has completed (changes healthcheck to block on provision until first success (and also block properly in general) #6407)
  3. set the default active health state of an uninspected remote to healthy.
  4. allow the configuration of 3, with a sane default
@elee1766 elee1766 changed the title Reverse proxy startup health check nehavior Reverse proxy startup health check behavior Jun 19, 2024
@elee1766 elee1766 changed the title Reverse proxy startup health check behavior Reverse proxy startup health check behavior results in 503 errors Jun 19, 2024
@mholt
Copy link
Member

mholt commented Jun 19, 2024

(As clarified in Slack, this is about active Health checks.)

My vote is currently for no. 1.

2 - Is a NO from me because blocking provisioning can make config reloads slow, and we strive to keep them fast and lightweight.

3 - is a NO from me because if the proxy is started before the backends, we can't assume that backends are healthy right away. IMO, active health checks should assume unhealthy unless proven otherwise by a passing health check (compared to passive health checks, which assume healthy until proved otherwise).

Number 1 is nice because it allows the server/config to start quickly, and the requests don't have to fail (even if they are delayed briefly). We also don't have bad status information. I imagine health checks -- especially passing ones -- happen very quickly, so the blocking will be instantaneous, less than 1/4 second probably.

@mholt mholt added the discussion 💬 The right solution needs to be found label Jun 19, 2024
@ottenhoff
Copy link
Contributor

Note that health_passes 3 means that after failing, an upstream node needs to pass three successive health checks to become healthy again.

I'm okay with 1 as long as the current behavior remains where a health check is immediately fired and the block is near instantaneous.

I believe other loadbalancers like Nginx (paid) assumes that all listed upstreams are healthy after a reload/restart and doesn't take them out of the mix until the health checks fail.

@elee1766
Copy link
Contributor Author

I believe other loadbalancers like Nginx (paid) assumes that all listed upstreams are healthy after a reload/restart and doesn't take them out of the mix until the health checks fail.

basically correct. during investigation i found that nginx plus and traefik set the initial state of the backend when no health checks have been made to them as healthy. However, they do preserve history across restarts to the same hosts (as does caddy, i believe)

@mholt
Copy link
Member

mholt commented Jun 19, 2024

basically correct. during investigation i found that nginx plus and traefik set the initial state of the backend when no health checks have been made to them as healthy.

I didn't think about what other servers do when we implemented health checks, but this is surprising to me... it feels wrong for active health checks to assume a healthy backend without checking first. Marking them as healthy when you don't actually know seems... misleading?

However, they do preserve history across restarts to the same hosts (as does caddy, i believe)

Caddy preserves the health status across reloads but if the process quits then the memory is cleared. We don't persist it to storage as of yet.

@elee1766
Copy link
Contributor Author

elee1766 commented Jun 19, 2024

Marking them as healthy when you don't actually know seems... misleading?

i think the argument can be made that marking them unhealthy is equally misleading. the remote is in superposition, since it has not been obvserved, it's a third distinct state that is currently handled as the unhealthy case. it seems existing implementations tip the scale slightly in favor the healthy superposition, my guess is it is order to have a faster time to first response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion 💬 The right solution needs to be found
Projects
None yet
Development

No branches or pull requests

3 participants