Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional checks for vague date ranges required? #23

Open
sacrevert opened this issue Feb 7, 2020 · 12 comments
Open

Additional checks for vague date ranges required? #23

sacrevert opened this issue Feb 7, 2020 · 12 comments

Comments

@sacrevert
Copy link

sacrevert commented Feb 7, 2020

Early records are less likely to be resolved to single years.
For example, the first exemplar row here
https://zenodo.org/record/3635510#.Xj1LLWj7SHt
1700 | 1kmE3802N3133 | 2287615 | 1 | 301
apparently derives from the GBIF record here https://www.gbif.org/occurrence/477065724
but this seems to misrepresent the original
https://mczbase.mcz.harvard.edu/guid/MCZ:Mala:152567
which gives a collecting date of 1700-2009 (i.e. presumably unknown or not digitised?)

  • Should the automated aggregation process should include some sort of flag for early records that are unlikely to, in reality, be resolved to a single year?
  • What checks could be done?
  • For example, it’s not clear to me why the GBIF record linked above has a date but also the claim of “no verbatim date data”, is this contradictory?
@qgroom
Copy link
Contributor

qgroom commented Feb 7, 2020

I think there might be an issue already raised with GBIF related to this. Last time I checked they couldn't handle date ranges in the eventDate field.

@qgroom qgroom closed this as completed Feb 7, 2020
@sacrevert
Copy link
Author

sacrevert commented Feb 7, 2020

The GBIF guidance suggests otherwise, unless you mean that there is currently a bug report open.
https://www.gbif.org/data-quality-requirements-occurrences#dcEventDate

Couldn't your process do some error checking to compare the interpreted and original event dates? For example, in the case above, there is clearly an error in the interpration of the originally supplied date. Seems like a fairly important issue for modelling trends in IAS, which presumably the aggregated dataset is going to be used for.
image
The checking could be conditional on the record date being assigned to the 1st January of any given year.

@qgroom
Copy link
Contributor

qgroom commented Feb 7, 2020

We could, but I have a suspicion that the Original might not be available in rgbif

@qgroom qgroom reopened this Feb 7, 2020
@qgroom
Copy link
Contributor

qgroom commented Feb 7, 2020

Closed in error

@sacrevert
Copy link
Author

The simplest approach would be just to manually check any record resolved to a day where that day was Julian day 0. This would at least exclude the most egregious errors. Over the past 6 years at BRC I have never seen an automated pipeline that didn't benefit from some manual checks or intervention.

@qgroom
Copy link
Contributor

qgroom commented Feb 8, 2020

I agree about manual checks, but we do need to keep this to a minimum for what we envision. In the case of Belgian data we are also the publishers of most of the data, so some problems can and should be fixed in the publication pipelines too.

@sacrevert
Copy link
Author

Looks like the Original data are available through the RESTful API at least http://api.gbif.org/v1/occurrence/477065724/verbatim

@damianooldoni
Copy link
Contributor

damianooldoni commented Feb 11, 2020

Thanks @sacrevert for your observation. Screening observation via querying the API endpoint verbatim is practically impossible when process millions of occurrences as it means millions of queries.

About the parsing of eventDate: in the link you sent (dcEventDate) is mentioned the following:

For the levels of information that are unknown, avoid padding and instead end the value, to limit ambiguity of interpretation. If, for example, only year and month are known, represent this as 2016-04, not as 2016-04-01.

The eventDate "1700-01-01/2009-02-04" is correct according to the ISO-standard. There are still parsing issues at GBIF side.

Once the GBIF issue is solved, we can think to assign the occurrence randomly to a specific year and add the column min_date_uncertainty. Something very similar to the processing of spatial information.

@peterdesmet
Copy link
Member

Last time I checked they couldn't handle date ranges in the eventDate field.
...
there is clearly an error in the interpration of the originally supplied date.

GBIF does now "handle" date ranges, as taking the first date of the range (see gbif/portal-feedback#652 (comment)). That is already an improvement from ignoring the date altogether, which was the case before.

@sacrevert
Copy link
Author

It's up to you guys really, I was just pointing out that early dates resolved to single years are often wrong, and this was obvious within about 10 seconds of looking at your "occurrence cube". My personal opinion is that extremely vague dates should not be arbitrarily assigned to single years, particularly if one is ultimately going to be producing trends for policy or broader ecological use or interpretation. Either the records should be ignored, or presented with full known range, so that later they can either be excluded or known to fall within a particular date range for modelling.

I suppose randomly assigning a year is one potential solution, although I would personally choose to exclude such data points, as they don't add any information and are liable to be misinterpreted by any uninitiated users downstream. This would assume that the dates were missing completely at random (in the statistical jargon), which is also unlikely, as the missingness is probably correlated with the true date of collection.

@peterdesmet
Copy link
Member

@sacrevert completely agree. I just suggested to GBIF to flag such records, so we can exclude them in the future: gbif/gbif-api#4 (comment)

@damianooldoni
Copy link
Contributor

damianooldoni commented Feb 11, 2020

Thanks @sacrevert for your comment. If these data would not be such useful, adding temporal uncertainty column is just making data processing more computationally demanding with no benefit for the researcher and making output larger and less readable.
As we can find a way to filter out these data fast, I will more than happy to make a new version of the occurrence cubes. I hope the solution of flagging proposed by @peterdesmet will be accepted very soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants