Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RedshiftDataOperator fails when return_sql_result is true, and SQL statements are provided #40427

Open
1 of 2 tasks
thesuperzapper opened this issue Jun 26, 2024 · 1 comment
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:amazon-aws AWS/Amazon - related issues

Comments

@thesuperzapper
Copy link
Contributor

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

Affects all current versions of apache-airflow-providers-amazon, including 8.24.0

Apache Airflow version

N/A - all versions

Operating System

NA

Deployment

Other

Deployment details

No response

What happened

There is a bug in RedshiftDataOperator if multiple sql queries are passed, and return_sql_result is set to true, then you will get the following error:

An error occurred (ValidationException) when calling the GetStatementResult operation: BatchExecuteStatement result can only be retrieved with sub-statement id.

What you think should happen instead

No response

How to reproduce

Run a RedshiftDataOperator like this:

from airflow.providers.amazon.aws.operators.redshift_data import RedshiftDataOperator
run_query = RedshiftDataOperator(
    task_id="run_query",
    aws_conn_id="MY_AWS_CONNECTION",
    #
    # Redshift parameters
    cluster_identifier="MY_REDSHIFT_CLUSTER",
    region="MY_REGION",
    database="MY_DB",
    sql=[
        "SELECT 1;",
        "SELECT 2;",
        "SELECT 3;",
    ],
    #
    # Redshift Data API parameters
    return_sql_result=True,
    statement_name="SOME_NAME",
    secret_arn=(
        "arn:aws:secretsmanager"
        ":MY_REGION:MY_ACCOUNT"
        ":secret:MY_DATA_API_CREDENTIAL_SECRET"
    ),
    #
    # Trigger parameters
    wait_for_completion=True,
    poll_interval=10,
)

Anything else

To fix this we will need to record the length of sql list, and loop through get_statement_result for each sub-statement id because you can only get one statement result at a time.

For example, if there are three, you would get the results of the following queries, starting from :1:

  • d9b6c0c9-0747-4bf4-b142-e8883122f766:1
  • d9b6c0c9-0747-4bf4-b142-e8883122f766:2
  • d9b6c0c9-0747-4bf4-b142-e8883122f766:3

There are TWO difference places where we call get_statement_result that need to be updated:

  1. For non-deferred mode in execute():
  2. For deferred mode in execute_complete()

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@thesuperzapper thesuperzapper added area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jun 26, 2024
@vatsrahul1001 vatsrahul1001 added provider:amazon-aws AWS/Amazon - related issues good first issue and removed needs-triage label for new issues that we didn't triage yet labels Jun 26, 2024
@isatyamks
Copy link

Hello @thesuperzapper,

I have made some changes to the RedshiftDataOperator to address the issue with handling multiple SQL statements. Please review my pull request #40443 and let me know if my approach is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

No branches or pull requests

3 participants