Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logic to skip already-indexed files #5

Open
allaway opened this issue Jul 12, 2024 · 1 comment
Open

Add logic to skip already-indexed files #5

allaway opened this issue Jul 12, 2024 · 1 comment

Comments

@allaway
Copy link

allaway commented Jul 12, 2024

A major use case of mine for this workflow is to index files in non-tower buckets that are attached to Synapse. Using a workflow like this is much more time efficient and less babysitting than indexing using on a single EC2 instance, especially as datasets get large (>1TB).

However, occasionally I've had to re-run this workflow multiple times on the same bucket when additional data has been added. This means the entire bucket gets re-downloaded and indexed when running the workflow. It would be helpful to have one or both of the following features to make this more time and cost efficient:

  • add a modified / created parameter that skips re-indexing any files that are prior to the date entered in the param
  • add an option to skip any S3 keys that are already in the target synapse project/folder
@BWMac
Copy link
Collaborator

BWMac commented Jul 15, 2024

Hi @allaway just linking the Jira ticket that we have tracking this issue: https://sagebionetworks.jira.com/browse/IBCDPE-692

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants