Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Objects count mismatches during listing & upon migration on follower site #29

Open
saif-0987 opened this issue Jun 2, 2024 · 16 comments
Assignees
Labels
bug Something isn't working

Comments

@saif-0987
Copy link

saif-0987 commented Jun 2, 2024

We are exploring Chorus tool for migration in our infrastructure, we encountered an issue where discrepancies arose between the reported number of objects stored on the main storage and the actual count reflected on follower storage.
Our initial assessment revealed that while the main storage contained a total of 433,773 objects, Chorus listed only 433,673. We confirmed it by chorctl dash command , and only 433610 objects reflected on follower storage site.

Additionally, we observed a discrepancy in the size of bucket after completion of replication jobs to the follower site. Our buckets was originally reported to be 112.2 gibibytes (GiB) in size. However, upon completion of the replication process, the size at the follower site was noted to be only 108.9 GiB.
We tried on couple of buckets and we got the same issue that mismatching the objects count while listing also difference in the size upon job completion.

At present, we're uncertain about the best course of action to address these discrepancies.

Regards,
Mohammad Saif

Please find the couple of snippet for reference:

Objects count at main storage:
image

During Migration:
image

Main storage site:
image

Follower storage site:
image

Upon completion:
image

Objects count on follower site:
image

@saif-0987 saif-0987 changed the title Objects count mismatch while listing on Main storage Objects count mismatch while listing & upon migration on follower site Jun 2, 2024
@saif-0987 saif-0987 changed the title Objects count mismatch while listing & upon migration on follower site Objects count mismatchs during listing & upon migration on follower site Jun 2, 2024
@saif-0987 saif-0987 changed the title Objects count mismatchs during listing & upon migration on follower site Objects count mismatches during listing & upon migration on follower site Jun 2, 2024
@arttor arttor added the bug Something isn't working label Jun 3, 2024
@arttor arttor self-assigned this Jun 3, 2024
@arttor
Copy link
Collaborator

arttor commented Jun 3, 2024

@saif-0987 thank you for reporting the issue. I marked it as a high-priority bug and will try to reproduce and fix it asap.

@arttor
Copy link
Collaborator

arttor commented Jun 4, 2024

@saif-0987 unfortunately, i was not able to reproduce the issue.
I've tried to run migrations with 100k+ objects using both fake s3 and real RGW. Tried flat bucket and buckets with "folders".
Can you please provide more details?

For example, is there are chance that chorus user don't have permissions to read some of the objects, or is there any objects with zero size?

Can you please check if there any errors in chorus worker logs or is there any similarities about skipped objects? You can find missing objects with rclone check command.

@saif-0987
Copy link
Author

saif-0987 commented Jun 5, 2024

@arttor, thank you for your response. We used the warp tool to generate the objects that are stored in both flat bucket and on prefixes/folders, we also pushed some of the objects using s3cmd.
We didn't apply any permissions or bucket policies to the this bucket and it seems none of the objects having zero size.

Unfortunately we could not able to see any errors in chorus worker logs.
But when we are executing the following command, it throws a "Resources exhausted" error.

image

Please let us know if you need any insight; we would be happy to help.

Regards,
Mohammad Saif

@arttor
Copy link
Collaborator

arttor commented Jun 5, 2024

@saif-0987 thank you. Can you please try to check missing objects with rclone?

rclone check <src>:bkt1 <dest>:bkt1 --size-only  --missing-on-dst --error --one-way

@saif-0987
Copy link
Author

saif-0987 commented Jun 6, 2024

@arttor Thanks. We executed the above command and it seems no difference between the source and destination.

image

But in our case:

  • Size of the bucket mismatched on both side.
  • Number of objects are differing on either side or even on the Chorus dashboard.

@arttor
Copy link
Collaborator

arttor commented Jun 6, 2024

now it is really strange. Are you using the same credentials in chorus and rclone? Because if both tools using the same credentials and saying that buckets are equal, this means that not synced objects from source rgw are probalby not available for this user.

@saif-0987
Copy link
Author

@arttor Yes, we are using the same credentials for both chorus and rclone.

@arttor
Copy link
Collaborator

arttor commented Jun 11, 2024

@saif-0987 can you please try again with v0.5.5 release?
there was an issues with listing zero size objects (fixed in #33 )

@saif-0987
Copy link
Author

@arttor Thanks for the response. We will try with v0.5.5 release and update.

@saif-0987
Copy link
Author

saif-0987 commented Jun 15, 2024

@arttor We have tried with https://github.com/clyso/chorus/releases/tag/v0.5.5 release, but we still encountered discrepancies in object counts. However, this time we observed that the data listed by chorus accurately reflected on the follower site, whereas previously there were inconsistencies in this as well.
Additionally, object sizes are mismatched on both sites like we experienced earlier.

Main site:
image
image

Follower Site:

image
image

chorctl dash :
image

@arttor
Copy link
Collaborator

arttor commented Jun 17, 2024

@saif-0987 unfortunately i am still not able to reproduce this issue. Is it possible to find which objects were not synced?

@arttor
Copy link
Collaborator

arttor commented Jun 19, 2024

@saif-0987 radosgw-admin shows uncompleted multipart uploads as a separate objects in statistics. Each completed part of uncompleted upload will be counted as object and will increase used bytes.
Please check if you have uncompleted multipart uploads in your source bucket. For this purpose you can use s3cmd or other s3 client:

s3cmd multipart s3://bkt1

If you delete uncompleted uploads from the source, then object and byte counts should became the same in radosgw-admin and chorus. This also explains why rclone says that buckets are equal.

@saif-0987
Copy link
Author

saif-0987 commented Jun 19, 2024

@arttor Thanks for your support. It appears that we have successfully resolved the issue. After deleting the multipart objects from the source site, we checked the bucket size and the number of objects. Now there is a discrepancy of only 1 in the object count, bucket size are matching on both side.
Below are the output.

Main:
image
image

Follower:
image
image

@arttor
Copy link
Collaborator

arttor commented Jun 19, 2024

@saif-0987 i think that i cannot help you further without knowing what object was not copied. if size are the same you should look for object with zero size.

@arttor
Copy link
Collaborator

arttor commented Jul 3, 2024

@saif-0987 do you have any updates or probably it is makes sense to close the issue?

@saif-0987
Copy link
Author

@arttor Actually i was trying to get the objects size which are having zero size , but could not able to get it so far. rclone didn't work on this, now trying with s3 api. We will get back to you shortly.
Also pls let us know if you can help me finding this.

Thanks
Mohammad Saif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants