fix: adding retries and workaround to mlflow integration to alleviate… #22718

JPercivall · 2024-06-25T22:22:51Z

… 429 issues (22633)

Summary & Motivation

The mlflow qps for searches is very low and if you have a flow with multiple ops run in docker containers that you run in parallel you quickly hit the qps. This adds retries and a secondary method to avoid searching for the run ids multiple times in a run.

How I Tested These Changes

I added units tests. Also I deployed a version locally which I tested with a sensor and jobs that I created to interface with mlflow. I tested passing the run_id in the resource dynamically and letting the resource create it.

… 429 issues (22633)

sryza

Thanks for the contribution @JPercivall. And for your patience in us getting to it.

A couple notes:

I unblocked the build; it looks like CI is complaining about a couple type-checking issues
We have a backoff function inside dagster._utils.backoff that we use for retries. Here's an example of how it's used: https://github.com/dagster-io/dagster/blob/master/python_modules/libraries/dagster-aws/dagster_aws/ecs/launcher.py#L648.

Lastly, is my understanding accurate that there are a few separate changes in here?

Adding retry logic when searching for runs
Adding mlflow run ID
Adding mlruns to .gitignore

If we were making this changes, we would probably separate these out into separate PRs. Not required, but a suggestion to consider.

JPercivall · 2024-06-28T16:41:55Z

Hey @sryza! Thank you for unblocking the CI and I will get on fixing the type issues.

Ah, I knew y'all had imported tenacity but didn't realize you already had a backoff util. I will update.

Yeah, the changes are relatively distinct, but they all relate to the original #22633 issue, so I figured I'd put them in one. I'm happy to separate them into multiple PRs.

I should have these changes together by the end of next week.

Thanks!

fix: adding retries and workaround to mlflow integration to alleviate…

5d63ad8

… 429 issues (22633)

JPercivall mentioned this pull request Jun 26, 2024

mlflow integration causes 429 error responses from Databricks when running multiple parallel jobs/ops #22633

Open

garethbrickman linked an issue Jun 26, 2024 that may be closed by this pull request

mlflow integration causes 429 error responses from Databricks when running multiple parallel jobs/ops #22633

Open

sryza reviewed Jun 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: adding retries and workaround to mlflow integration to alleviate… #22718

fix: adding retries and workaround to mlflow integration to alleviate… #22718

JPercivall commented Jun 25, 2024

sryza left a comment

JPercivall commented Jun 28, 2024

fix: adding retries and workaround to mlflow integration to alleviate… #22718

Are you sure you want to change the base?

fix: adding retries and workaround to mlflow integration to alleviate… #22718

Conversation

JPercivall commented Jun 25, 2024

Summary & Motivation

How I Tested These Changes

sryza left a comment

Choose a reason for hiding this comment

JPercivall commented Jun 28, 2024