Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bioRxiv PDFs not being shown #426

Open
thewilkybarkid opened this issue Nov 15, 2021 · 14 comments
Open

bioRxiv PDFs not being shown #426

thewilkybarkid opened this issue Nov 15, 2021 · 14 comments
Labels
bug Something isn't working

Comments

@thewilkybarkid
Copy link
Member

bioRxiv preprints (such as https://prereview.org/preprints/doi-10.1101-2021.11.10.468081) are showing the abstract text rather than the PDF, even though the license allows its use.

@thewilkybarkid thewilkybarkid added the bug Something isn't working label Nov 15, 2021
@thewilkybarkid
Copy link
Member Author

thewilkybarkid commented Nov 15, 2021

Looking at

export async function resolvePreprint(
handle: string,
): Promise<PreprintMetadata> {
log.debug('Resolving preprint with handle:', handle);
const isDoi = doiRegex().test(handle);
const isArxiv = identifiersArxiv.extract(handle)[0];
const resolvers = [];
const baseUrlArxivHtml = 'https://arxiv.org/abs/';
const baseUrlDoi = 'https://doi.org/';
// as a last resort check Google Scholar
resolvers.push(
searchGoogleScholar(handle).catch(err =>
log.error('Not found on Google Scholar: ', err),
),
);
// check crossref if nothing is found on official sites
if (isDoi) {
resolvers.push(
searchCrossRef(handle).catch(err =>
log.warn('Not found on CrossRef: ', err),
),
);
}
// fetch data based on publication type (DOI / arXiv)
if (isDoi || isArxiv) {
// checks if the publication is DOI or arXiv
let url: string, type: string;
if (isDoi) {
log.debug('Resolving preprint with a DOI');
url = `${baseUrlDoi}${handle}`;
type = 'doi';
} else {
log.debug('Resolving preprint with an arXivId');
url = `${baseUrlArxivHtml}${handle}`;
type = 'arxiv';
}
resolvers.push(
scrapeUrl(url, handle, type).catch(err =>
log.warn('No metadata found via scrape: ', err),
),
);
}
const results = await Promise.all(resolvers);
const metadata: PreprintMetadata = _.merge({}, ...results);
log.debug('Finalized preprint metadata:', metadata);
return metadata;
}
it tries to scrape the information from the preprint page itself, then the Crossref API if it doesn't work, and lastly tries to scrape Google Scholar.

I've tried this preprint locally, and the bioRxiv page fails due to hitting a Cloudflare CAPTCHA. The Crossref API resolves, but it doesn't seem to contain information about the PDF.

@thewilkybarkid
Copy link
Member Author

I can replicate seeing the CAPTCHA with curl -L http://dx.doi.org/10.1101/2021.11.10.468081

@thewilkybarkid
Copy link
Member Author

The Crossref API entry is https://api.crossref.org/works/10.1101/2021.11.10.468081: it doesn't contain information about the PDF/HTML views, nor means to access them.

Likewise, the bioRxiv API doesn't provide the details (https://api.biorxiv.org/details/biorxiv/10.1101/2021.11.10.468081). It does, however, return a link to the JATS XML version, which in turn has a broken link to the PDF.

@thewilkybarkid
Copy link
Member Author

I've just tried about the Google Scholar integration and it can find links to PDFs... but there's presumably a lag with indexing and so this particular preprint isn't yet available.

@thewilkybarkid
Copy link
Member Author

I'm also not sure if PREreview refreshes its local data. (e.g. What happens with a bioRxiv preprint that changes its name in a later version?)

@thewilkybarkid
Copy link
Member Author

Looks like Sciety had the same bioRxiv problem (sciety/sciety#1200).

@thewilkybarkid
Copy link
Member Author

EuropePMC doesn't have the information available: https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=DOI:10.1101/2021.11.10.468081&resultType=core&format=json even saying it's not open access.

The EuropePMC page (https://europepmc.org/article/PPR/PPR419421) does have the PDF Link, but saying it's come from Unpaywall.

@thewilkybarkid
Copy link
Member Author

The Unpaywall API does have the PDF link: https://api.unpaywall.org/v2/10.1101/[email protected]

Looking at recent bioRxiv articles (i.e. ones first published yesterday/today), there is a bit of a lag between being published and appearing on Unpaywall (I think there's a delay between it being published and it being available on the Crossref API, and then in turn on the Unpaywall API).

@thewilkybarkid
Copy link
Member Author

The sample article is now appearing on Google Scholar, but only the Europe PMC entry which doesn't produce a PDF link.

@thewilkybarkid
Copy link
Member Author

thewilkybarkid commented Nov 16, 2021

Options I can see:

One or more of:

  1. Ask bioRxiv to give PDF information to Crossref
  2. Ask bioRxiv to add PDF information to their API and use that
  3. Ask bioRxiv for a way past the bot detection
  4. Use the Unpaywall API (and accept the delay)

Or:

  1. Do nothing.

@dasaderi
Copy link
Member

I say 4 is the quickest option. Do you know what is the delay? And would this option allow for (re)fetching the pdf later when available if it was first imported when not available?

@thewilkybarkid
Copy link
Member Author

Do you know what is the delay?

Hard to know, and probably changes per article. Right now (10.40 UTC 17 November 2021), the most recent article I can find available all happened on the 16 November (times all UTC):

So that case a magnitude of hours. But for the next article published:

and is not yet available on the Unpaywall API (https://api.unpaywall.org/v2/10.1101/[email protected] returning a not found error).

From Sciety's experience, we know in some cases data doesn't appear in the Crossref API for quite a while (sciety/sciety#1199, sciety/sciety#664).

With this very limited data, I'd say between hours and days. (So quicker than Google Scholar, which IIRC can take weeks.)

The only guaranteed way to be able to get the information is from the bioRxiv page itself, which currently is not allowed to be read by machines.

And would this option allow for (re)fetching the pdf later when available if it was first imported when not available?

I need to dig more into the code, but I suspect there isn't any re-fetching of information. If that is the case, it's possibly more valuable to add this first while asking bioRxiv for assistance.

@thewilkybarkid
Copy link
Member Author

Looks like requesting the bioRxiv page is working again, so PDFs are appearing.

https://prereview.org/preprints/doi-10.1101-2021.11.10.468081 is still showing just the abstract, so there is work to do in re-fetching information.

@dasaderi
Copy link
Member

Yay! Thanks @thewilkybarkid!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants