Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce header section to toml workflows #260

Open
MartinKl opened this issue Jun 26, 2024 · 2 comments
Open

Introduce header section to toml workflows #260

MartinKl opened this issue Jun 26, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@MartinKl
Copy link
Collaborator

There should be an optional header section, that (potentially) holds the following information:

  • which version of annatto the workflow was successfully run with. This would require a write-back to the file in case we want annatto to set this automatically. Technically it could also be a key only to be set by users, since a successful run does not imply that the workflow produces the desired output across all versions
  • the memory switch (ANNATTO_IN_MEMORY is currently an environment variable) could be placed there. There are pros and cons: On the one hand, this would certainly be more user-friendly, but could on the other hand make workflows less portable for large corpora, i. e. another user running the same workflow on the same data might not have the same memory available to run the process in it.
@MartinKl MartinKl added the enhancement New feature or request label Jun 26, 2024
@thomaskrause
Copy link
Member

thomaskrause commented Jun 26, 2024

I would argue that the ANNATTO_IN_MEMORY needs more discoverability, but this can also be achieved by adding it as a command line argument in addition to using the environment variable. This would make it appear in the --help description, and we could add a good explanation of what it does. The clap command line argument parser supports the env field that could be added to the argument.

// [...]
#[clap(env="ANNATTO_IN_MEMORY")]
use_memory: bool,
// [...]

See also the builder API for env

We default to on disk mode because even if this can be much slower, we want to make sure the conversion does not fail. A "hint" in the header section would probably mean something like "this is a small corpus and would most likely fit in your main memory". But since we know nothing about the environment the workflow is run (could be even a very resource limited CI job) we should not include this in the workflow in my opinion.

There is however the downside, that there could be a similar situation like in Rust with the --release compile flag, where people discover that their programs get magically faster by applying it. So we really have to work on the discoverability, e.g. with the flag and/or with a troubleshooting guide. Also, the idea of "hints" in addition to "warnings" in the console output is flowing in my head. If annatto discovers that something could work better (like running faster because the corpus is small), we could output "hints" to the user. Warnings would indicate that something is wrong (which it isn't) and would be the wrong way to approach this. But "hints" could help with discoverability when workflows and annatto become more complex.

@MartinKl
Copy link
Collaborator Author

I agree, just one hint:

As #228 is unresolved, running a workflow on disk can provide incorrect / incomplete output data. That would weaken the argument for defaulting to disk mode, which I nevertheless agree it should be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants