Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up Continuous Integration for build scripts #63

Open
peterdesmet opened this issue Oct 22, 2020 · 4 comments
Open

Set up Continuous Integration for build scripts #63

peterdesmet opened this issue Oct 22, 2020 · 4 comments

Comments

@peterdesmet
Copy link
Member

I'm starting this issue after chat discussions following @baskaufs talk in a TDWG 2020 session.

Thanks to @baskaufs we can now manage controlled vocabularies as csv files, which are then transformed into the necessary files (json-ld, markdown) using scripts. As far as I'm aware, those build scripts need to be run by someone (@baskaufs) locally and then pushed to [a server, the html directory, I'm not really sure] to become available. It would be nice if those scripts could be run automatically (continuous integration) when commits are made (e.g. to a staging server for non master branch and production for master).

@jmbarrios @mjy @MattBlissett any feedback on how to set this up best?

@mjy
Copy link

mjy commented Oct 22, 2020

@LocoDelAssembly in as much as we want to integrate the reference to terms in our importer I'd be happy to have you prioritize a little time helping out here to build out the CI (if you want!).

Ping also @jlpereira who might consume the outcome of automated efforts in our issues.
SpeciesFileGroup/taxonworks#1845
SpeciesFileGroup/taxonworks#1766

@jmbarrios
Copy link
Member

I'm not fully aware what steps are necessary to execute in order to get the markdown files.

Normally, you can attach a Jenkins server to execute those deploy tasks.

Could you explain me the building process?

@MattBlissett
Copy link
Member

The build uses GBIF's continuous integration server, and an overview of the process is given here: https://github.com/tdwg/rs.tdwg.org/blob/master/DEPLOYMENT.md#deployment

It's building the master branch and deploying it to http://rs-test.tdwg.org. If that's not sufficient, we'll need to think about what is needed -- it does need to be clear to users what is deployed to http://rs-test.tdwg.org or similar.

@baskaufs
Copy link
Contributor

I'm in favor of doing this in the long run, but in the short term I don't think it's ready for prime time. There are several issues:

  1. There are currently multiple permutations for what needs to happen if the vocabulary is new vs. if it's a modification, if it's a TDWG-minted vocabulary vs. borrowed terms, etc. Right now the variations are handled by a human running different cells in a Jupyter notebook and the script would need to be modified to accept configuration options from the command line to select which parts of the script to run. Not a huge problem, but it isn't done at the moment yet.
  2. One of the reasons for running the script locally is to be able to use the Git diff functions to make sure that the script is outputting the right content. Initially, this was done to detect bugs in the script, although I've run the script enough times that most or all of them are probably gone. However, the other thing this catches is bad output caused by bad input. There is no sort of validation that's run on the spreadsheets that people supply to make the changes to the vocabularies, so it's pretty easy for people to make mistakes like shifting cells to the wrong column, including content that isn't supposed to be there, failing to set the configuration correction for the type of spreadsheet, etc. So it would probably be pretty dangerous to just feed in spreadsheets and let them flow through the whole process without some intermediate quality control checks.
  3. The situation is complicated somewhat by the fact that slightly different inputs are required depending on whether it's just a new term, or a new namespace, or an entirely new vocabulary.
  4. There is also a whole part of the process that isn't scripted yet: updating the metadata for documents associated with standards. Right now, that also is being done by hand. Eventually, this needs to be scripted in a way similar to the way it's done for the vocabularies themselves, but that hasn't been done yet.
  5. The other complication is that handling the metadata side of things is done out of the rs.tdwg.org repo, but the actual standards documents are managed by the separate maintenance groups (Audubon Core and Darwin Core at the moment, but probably others in the future). So the actual document build scripts are in the ac and dwc repos, not rs.tdwg.org .

There are probably more things that I'm not thinking of. None of these are insurmountable obstacles, but this hopefully gives you an idea why it probably isn't feasible to set up continuous integration right now.

I will definitely be thinking about how to streamline the process to try to get us to the point where the human-mediated steps are taken out of the system. They are where nearly all of the errors are occurring at this point. So it's great to know that y'all are willing to help with this as it becomes more feasible in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants