Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send important executor logs to task logs #40468

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

vincbeck
Copy link
Contributor

@vincbeck vincbeck commented Jun 27, 2024

If the executor fails to start a task, the user will not see any logs in the UI because the task has not started. This PR leverages TaskContextLogger implemented in #32646. It forwards the important error messages when an executor fail to execute a task to the task logs.

cc @o-nikolas


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Copy link
Contributor

@o-nikolas o-nikolas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. Also what about the Batch executor? Do you plan on fast following with that after this PR?

airflow/executors/base_executor.py Show resolved Hide resolved
airflow/utils/log/task_context_logger.py Outdated Show resolved Hide resolved
@vincbeck
Copy link
Contributor Author

vincbeck commented Jun 27, 2024

Left a few comments. Also what about the Batch executor? Do you plan on fast following with that after this PR?

Yep, my plan is to first do it with one executor and receive feedbacks, address feedbacks etc ... Then when the direction is set, I'll create another PR for BatchExecutor

@ferruzzi
Copy link
Contributor

Had a look and I don't have anything extra to add; I like the direction this is going. I'd need to see some examples to have any real opinion on the question of one or one-per-executor but otherwise, I think it looks good after that one change to the session creation that Niko mentioned.

self.log.error(
"could not queue task %s (still running after %d attempts)", key, attempt.total_tries
self.task_context_logger.error(
"could not queue task %s (still running after %d attempts)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's confusing for a user to see that a task couldn't be queued because it's currently running. Is there any way to make this log message more useful/intuitive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but I've run into this when troubleshooting and it always confuses me 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're the perfect candidate to create a meaningful error message then :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the error message, please let me know if that looks better to you

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about...

Could not queue task %s. Learn more: https://airflow.apache.org/docs/apache-airflow/stable/troubleshooting.html#still-running-on-executor

And add a blurb in troubleshooting.rst explaining what the comment says (triggerer race condition or the task is soon going to be marked failed)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works too. Thanks for the proposal!

Copy link
Contributor

@ferruzzi ferruzzi Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response, I was thinking about this a bit and it looks like you've settled on something, but an alternative just came to me... take it or leave it, of course... but I'll suggest an option:

[Could not | Failed to] queue task %s after %d attempts; executor [reports | notes | claims | states] task [is | as] [currently | already | " "] running.

I'm still juggling the phrasing.. but something along the lines of
Failed to queue task %s after %d attempts; executor reports task is currently running.

perhaps with the executor name/id in there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this proposal. @RNHTTR, what do you think? TBH @RNHTTR, I have nothing against writing a section in troubleshooting.rst about this special use case but since I have never encountered this use case I dont feel strongly confident describing it (since I am not sure how it happens). Or maybe you could describe it @RNHTTR? That'd be helpful

Copy link
Contributor

@RNHTTR RNHTTR Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont feel strongly confident describing it (since I am not sure how it happens)

This is my problem too, which is why I'm hoping we can come up with something more useful or just not surface this log. Usually when I encounter this, I mostly ignore it and look for something else that's meaningful.

In my opinion, executor reports task is currently running is still tricky to users. If I understand correctly, it's reporting the state of the task in, for example, celery, which is different than the Airflow state of the task. I think surfacing this log as written will only confuse users more than if it didn't show up at all.

@@ -386,14 +385,16 @@ def attempt_task_runs(self):
)
self.pending_tasks.append(ecs_task)
else:
self.log.error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@syedahsn You should have a review of this if you get a chance. The failure reason handling has been modified here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Executors-core LocalExecutor & SequentialExecutor area:logging area:providers provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants