-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IBCDPE-1004] Move back to celery executor #28
Conversation
@@ -2010,9 +2010,9 @@ limits: [] | |||
|
|||
# This runs as a CronJob to cleanup old pods. | |||
cleanup: | |||
enabled: true | |||
enabled: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed when we are running celery executor since the worker pods are controlled in a different way than when starting pods for tasks through the kubernetes exector.
Are there any downsides to having replicas? Cost? Something else? I'm also wondering if this solution will guarantee that airflow never goes down, or if it merely makes it less likely. Why can't we / would we not want to run the scheduler or other critical Airflow components on a non-spot instance? |
Question questions @philerooski
We can run them on reserved instances, however there is the cost of running those reserved instances.
We are able to run these airflow components in master-master mode. Essentially each airflow component works independently doing the same things, but what occurs is that when a task is picked up to be executed it depends on postgres locking that database row. When postgres locks it no other master nodes can pick up the task. Running multiple copies of an application is standard within kubernetes, even when not on spot instances. In our case since we are running spot instances it is relatively "cheap" to just run more nodes and led pods be evicted when a node goes down. Having more replicas on the cluster means more resource usage. And airflow takes significantly more resources than anything else running on the cluster. This means that other processes could starve for resources if airflow usage is very high. Though we could take this further by allocating nodes specifically to run airflow infra, this is called a "Share nothing" approach to provisioning kubernetes resources.: For this I am expecting that it will reduce the chance for an outage of airflow, however, there is no guarantee that it is 0%. We will need to look back and see. |
Thanks for the detailed answer @BryanFauble , I'm learning a lot 😄 If the tasks are maintained by the database, then Airflow going down completely seems more like an inconvenience (tasks can't be scheduled) rather than a catastrophe (we lose our intermediary processing results). And if we're losing spot instances then we probably don't have much compute available to do work, regardless. |
Problem:
Solution:
Testing: