Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Investigate how to follow a link in a job-toot and index the body of that link as well #7

Open
berkes opened this issue Aug 9, 2022 · 0 comments

Comments

@berkes
Copy link
Contributor

berkes commented Aug 9, 2022

Possible candidate for following is the "card" when that is present.

We'd need

  • Sane timeout to avoid hanging when host of the vacancy is unavailable or blocking.
  • TXT/HTML checking. PDF support for later. Anything else should be disgarded.
  • Length check. Anything longer than X bytes should be chopped off. 500kb? Timeout will catch many of these too, but a very fast host might still serve us megabytes on which we then choke.
  • Sanitizer or semantic text-analyzer; so we can parse HTML in a somewhat sane way and remove things like menus, footers, sidebars. What options are there FLOSS for this?
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant