Additional checks for vague date ranges required? #23

sacrevert · 2020-02-07T16:10:24Z

Early records are less likely to be resolved to single years.
For example, the first exemplar row here
https://zenodo.org/record/3635510#.Xj1LLWj7SHt
1700 | 1kmE3802N3133 | 2287615 | 1 | 301
apparently derives from the GBIF record here https://www.gbif.org/occurrence/477065724
but this seems to misrepresent the original
https://mczbase.mcz.harvard.edu/guid/MCZ:Mala:152567
which gives a collecting date of 1700-2009 (i.e. presumably unknown or not digitised?)

Should the automated aggregation process should include some sort of flag for early records that are unlikely to, in reality, be resolved to a single year?
What checks could be done?
For example, it’s not clear to me why the GBIF record linked above has a date but also the claim of “no verbatim date data”, is this contradictory?

The text was updated successfully, but these errors were encountered:

qgroom · 2020-02-07T16:42:11Z

I think there might be an issue already raised with GBIF related to this. Last time I checked they couldn't handle date ranges in the eventDate field.

sacrevert · 2020-02-07T16:49:03Z

The GBIF guidance suggests otherwise, unless you mean that there is currently a bug report open.
https://www.gbif.org/data-quality-requirements-occurrences#dcEventDate

Couldn't your process do some error checking to compare the interpreted and original event dates? For example, in the case above, there is clearly an error in the interpration of the originally supplied date. Seems like a fairly important issue for modelling trends in IAS, which presumably the aggregated dataset is going to be used for.

The checking could be conditional on the record date being assigned to the 1st January of any given year.

qgroom · 2020-02-07T17:54:05Z

We could, but I have a suspicion that the Original might not be available in rgbif

qgroom · 2020-02-07T17:55:04Z

Closed in error

sacrevert · 2020-02-07T20:01:53Z

The simplest approach would be just to manually check any record resolved to a day where that day was Julian day 0. This would at least exclude the most egregious errors. Over the past 6 years at BRC I have never seen an automated pipeline that didn't benefit from some manual checks or intervention.

qgroom · 2020-02-08T07:17:01Z

I agree about manual checks, but we do need to keep this to a minimum for what we envision. In the case of Belgian data we are also the publishers of most of the data, so some problems can and should be fixed in the publication pipelines too.

sacrevert · 2020-02-08T07:33:36Z

Looks like the Original data are available through the RESTful API at least http://api.gbif.org/v1/occurrence/477065724/verbatim

damianooldoni · 2020-02-11T09:10:41Z

Thanks @sacrevert for your observation. Screening observation via querying the API endpoint verbatim is practically impossible when process millions of occurrences as it means millions of queries.

About the parsing of eventDate: in the link you sent (dcEventDate) is mentioned the following:

For the levels of information that are unknown, avoid padding and instead end the value, to limit ambiguity of interpretation. If, for example, only year and month are known, represent this as 2016-04, not as 2016-04-01.

The eventDate "1700-01-01/2009-02-04" is correct according to the ISO-standard. There are still parsing issues at GBIF side.

Once the GBIF issue is solved, we can think to assign the occurrence randomly to a specific year and add the column min_date_uncertainty. Something very similar to the processing of spatial information.

peterdesmet · 2020-02-11T09:11:47Z

Last time I checked they couldn't handle date ranges in the eventDate field.
...
there is clearly an error in the interpration of the originally supplied date.

GBIF does now "handle" date ranges, as taking the first date of the range (see gbif/portal-feedback#652 (comment)). That is already an improvement from ignoring the date altogether, which was the case before.

sacrevert · 2020-02-11T09:21:20Z

It's up to you guys really, I was just pointing out that early dates resolved to single years are often wrong, and this was obvious within about 10 seconds of looking at your "occurrence cube". My personal opinion is that extremely vague dates should not be arbitrarily assigned to single years, particularly if one is ultimately going to be producing trends for policy or broader ecological use or interpretation. Either the records should be ignored, or presented with full known range, so that later they can either be excluded or known to fall within a particular date range for modelling.

I suppose randomly assigning a year is one potential solution, although I would personally choose to exclude such data points, as they don't add any information and are liable to be misinterpreted by any uninitiated users downstream. This would assume that the dates were missing completely at random (in the statistical jargon), which is also unlikely, as the missingness is probably correlated with the true date of collection.

peterdesmet · 2020-02-11T09:24:49Z

@sacrevert completely agree. I just suggested to GBIF to flag such records, so we can exclude them in the future: gbif/gbif-api#4 (comment)

damianooldoni · 2020-02-11T09:32:31Z

Thanks @sacrevert for your comment. If these data would not be such useful, adding temporal uncertainty column is just making data processing more computationally demanding with no benefit for the researcher and making output larger and less readable.
As we can find a way to filter out these data fast, I will more than happy to make a new version of the occurrence cubes. I hope the solution of flagging proposed by @peterdesmet will be accepted very soon.

qgroom closed this as completed Feb 7, 2020

qgroom reopened this Feb 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional checks for vague date ranges required? #23

Additional checks for vague date ranges required? #23

sacrevert commented Feb 7, 2020 •

edited

Loading

qgroom commented Feb 7, 2020

sacrevert commented Feb 7, 2020 •

edited

Loading

qgroom commented Feb 7, 2020

qgroom commented Feb 7, 2020

sacrevert commented Feb 7, 2020

qgroom commented Feb 8, 2020

sacrevert commented Feb 8, 2020

damianooldoni commented Feb 11, 2020 •

edited

Loading

peterdesmet commented Feb 11, 2020

sacrevert commented Feb 11, 2020

peterdesmet commented Feb 11, 2020

damianooldoni commented Feb 11, 2020 •

edited

Loading

Additional checks for vague date ranges required? #23

Additional checks for vague date ranges required? #23

Comments

sacrevert commented Feb 7, 2020 • edited Loading

qgroom commented Feb 7, 2020

sacrevert commented Feb 7, 2020 • edited Loading

qgroom commented Feb 7, 2020

qgroom commented Feb 7, 2020

sacrevert commented Feb 7, 2020

qgroom commented Feb 8, 2020

sacrevert commented Feb 8, 2020

damianooldoni commented Feb 11, 2020 • edited Loading

peterdesmet commented Feb 11, 2020

sacrevert commented Feb 11, 2020

peterdesmet commented Feb 11, 2020

damianooldoni commented Feb 11, 2020 • edited Loading

sacrevert commented Feb 7, 2020 •

edited

Loading

sacrevert commented Feb 7, 2020 •

edited

Loading

damianooldoni commented Feb 11, 2020 •

edited

Loading

damianooldoni commented Feb 11, 2020 •

edited

Loading