Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

analyzer, log-firehose, or trimmer crash may result in a functional deadlock (aka. postmortem on the 2021-08-05 incident) #519

Open
JustAnotherArchivist opened this issue Aug 5, 2021 · 0 comments

Comments

@JustAnotherArchivist
Copy link
Contributor

Last night, various things on the control node crashed. I haven't been able to establish the exact chain of events, but I believe that first either Redis or the trimmer had some sort of minor issue, which caused the trimmer to die with an error. This meant that the job logs accumulated, and eventually Redis was slaughtered by the OOM killer. Unfortunately, the RDB file already had these accumulated logs, which meant that restarting Redis would simply lead to another OOM kill immediately.

As I understand it, this could also happen if the trimmer was still working fine but the analyzer or the log-firehose crashed, because that would also break trimming (via no longer updating last_analyzed_log_entry and last_broadcasted_log_entry, respectively).

I'm not sure what the solution for this is – other than redesigning the entire log system so that messages don't go into Redis in the first place (which is planned).


I'll also use this to document the fix:

  1. Stop the/most pipelines' SSH connections to prevent them from immediately spamming Redis with further log lines.
  2. Stop anything else memory-intensive that's still running and not really needed (dashboard, websocket, cogs).
  3. Restart Redis and hope that it doesn't OOM. (If it does, free more RAM or temporarily increase swap, I guess?)
  4. Run the analyzer and the firehose manually for all jobs.
    • This is to update the two fields mentioned above so that the trimmer can do its job. The analyzer will do its usual thing, the firehose will send the log messages into the void (since the dashboard WebSocket server isn't running), but that's fine.
    • The normal way to run these is with updates-listener, but that wouldn't work because the pipelines are disconnected, so no job IDs are being pushed to the updates channel.
    • This grepping for job IDs is obviously not perfect, but it should at least trim it down enough to get out of the OOM zone as most job IDs are 24 or 25 characters long.
    • redis-cli keys '*' | grep -P '^[0-9a-z]{24,25}$' | REDIS_URL=redis://127.0.0.1:6379/0 plumbing/analyze-logs
    • redis-cli keys '*' | grep -P '^[0-9a-z]{24,25}$' | REDIS_URL=redis://127.0.0.1:6379/0 FIREHOSE_SOCKET_URL=tcp://127.0.0.1:12345 plumbing/log-firehose
    • Naturally, it should be possible to just update the broadcasted key directly, but I didn't look into how to do that.
  5. Run the trimmer manually: redis-cli keys '*' | grep -P '^[0-9a-z]{24,25}$' | REDIS_URL=redis://127.0.0.1:6379/0 plumbing/trim-logs >/dev/null
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant