Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IBCDPE-1004] Move back to celery executor #28

Merged
merged 4 commits into from
Aug 21, 2024

Conversation

BryanFauble
Copy link
Contributor

@BryanFauble BryanFauble commented Aug 21, 2024

Problem:

  1. Usage of the kubernetes executor requires setting up a remote write solution to write logs to (Like S3).
  2. Because of the volatile nature of running the cluster on spot instances I was seeing that components of airflow were going down because the node was taken out of the cluster.

Solution:

  1. Fall back to celery executor.
  2. Bump replicas up for core airflow components to allow scheduling, and running of tasks in high availability. The pod anti affinity should also cause the pods to be scheduled on different nodes, giving up more uptime.

Testing:

  1. I verified this on the sandbox cluster:
    image

image

@BryanFauble BryanFauble marked this pull request as ready for review August 21, 2024 17:51
@BryanFauble BryanFauble requested a review from a team as a code owner August 21, 2024 17:51
@@ -2010,9 +2010,9 @@ limits: []

# This runs as a CronJob to cleanup old pods.
cleanup:
enabled: true
enabled: false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed when we are running celery executor since the worker pods are controlled in a different way than when starting pods for tasks through the kubernetes exector.

@philerooski
Copy link

Are there any downsides to having replicas? Cost? Something else?

I'm also wondering if this solution will guarantee that airflow never goes down, or if it merely makes it less likely. Why can't we / would we not want to run the scheduler or other critical Airflow components on a non-spot instance?

@BryanFauble
Copy link
Contributor Author

Are there any downsides to having replicas? Cost? Something else?

I'm also wondering if this solution will guarantee that airflow never goes down, or if it merely makes it less likely. Why can't we / would we not want to run the scheduler or other critical Airflow components on a non-spot instance?

Question questions @philerooski

Why can't we / would we not want to run the scheduler or other critical Airflow components on a non-spot instance?

We can run them on reserved instances, however there is the cost of running those reserved instances.

Are there any downsides to having replicas? Cost? Something else?
I'm also wondering if this solution will guarantee that airflow never goes down, or if it merely makes it less likely.

We are able to run these airflow components in master-master mode. Essentially each airflow component works independently doing the same things, but what occurs is that when a task is picked up to be executed it depends on postgres locking that database row. When postgres locks it no other master nodes can pick up the task.

Running multiple copies of an application is standard within kubernetes, even when not on spot instances. In our case since we are running spot instances it is relatively "cheap" to just run more nodes and led pods be evicted when a node goes down. Having more replicas on the cluster means more resource usage. And airflow takes significantly more resources than anything else running on the cluster. This means that other processes could starve for resources if airflow usage is very high. Though we could take this further by allocating nodes specifically to run airflow infra, this is called a "Share nothing" approach to provisioning kubernetes resources.:
image

For this I am expecting that it will reduce the chance for an outage of airflow, however, there is no guarantee that it is 0%. We will need to look back and see.

@BryanFauble BryanFauble merged commit 3472cbd into main Aug 21, 2024
3 of 4 checks passed
@BryanFauble BryanFauble deleted the move-back-to-celery-executor branch August 21, 2024 18:16
@philerooski
Copy link

Thanks for the detailed answer @BryanFauble , I'm learning a lot 😄

If the tasks are maintained by the database, then Airflow going down completely seems more like an inconvenience (tasks can't be scheduled) rather than a catastrophe (we lose our intermediary processing results). And if we're losing spot instances then we probably don't have much compute available to do work, regardless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants