Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop rotating on partition change when rotate.interval.ms is set #715

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

schizhov
Copy link

@schizhov schizhov commented Feb 6, 2024

Problem

Committing open files on partition change results in creating a lot of small files when records belonging to different partitioned are interleaved. We have a use case where we aggregate raw events into sessions spanning for 5-15 minutes with session time being the time of the first event in it. We use hourly partitioning and observe up to 10x increase in number of files per hour due to this.

The issue has been reported multiple time previously:

And even had an attempted fix:

Solution

Removing rotation on partition changes makes the semantics of rotate.interval.ms similar to flush.size.
It now defines constrains not for a single file, but for a "segment" of a stream:
records are accumulated in appropriate partitions until partition time advances at least rotate.interval.ms from the first time of the message in the "segment", at which point all files are flushed.

Testing

We have been running the patched version in our staging environment for more then a week now with constant consistency checks and have not seen any issues neither with number of files per hour nor with the consistency of the results.

Finally documentation for 'rotate.interval.ms' might need to be adjusted. Would appreciate any advice on how to do that.

Rotating on partition change results in creating a lot of small files in case record time is not monotonous.
@schizhov schizhov requested a review from a team as a code owner February 6, 2024 17:14
Copy link

cla-assistant bot commented Feb 6, 2024

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@schizhov
Copy link
Author

schizhov commented Feb 7, 2024

Jenkins build complains about missing 204-ccs artifacts, which I don't see in the dependency tree locally:

[ERROR] Failed to execute goal on project kafka-connect-s3: Could not resolve dependencies for project io.confluent:kafka-connect-s3:jar:10.6.0-SNAPSHOT: The following artifacts could not be resolved: org.apache.kafka:connect-runtime:jar:test:7.6.0-204-ccs, org.apache.kafka:kafka_2.12:jar:7.6.0-204-ccs, org.apache.kafka:kafka_2.12:jar:test:7.6.0-204-ccs: org.apache.kafka:connect-runtime:jar:test:7.6.0-204-ccs was not found in https://s3-us-west-2.amazonaws.com/confluent-snapshots/ during a previous attempt. This failure was cached in the local repository and resolution is not reattempted until the update interval of confluent-snapshots has elapsed or updates are forced -> [Help 1]

I would appreciate any guidance on how to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant