Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global deduplication for specific URLs #443

Open
JustAnotherArchivist opened this issue May 20, 2020 · 3 comments
Open

Global deduplication for specific URLs #443

JustAnotherArchivist opened this issue May 20, 2020 · 3 comments

Comments

@JustAnotherArchivist
Copy link
Contributor

While global deduplication for everything in ArchiveBot is not feasible, we should consider adding something for certain URLs that waste a lot of disk space, probably shouldn't be ignored entirely, but are regrabbed needlessly and repeatedly. Two examples come to mind:

  • CBC radio recordings/podcasts: ignores ^https?://mp3\.cbc\.ca/ and ^https?://podcast-a\.akamaihd\.net/mp3/ (pending further investigation whether the latter also has non-CBC content)
  • Fast Company videos: ignore ^https?://content\.jwplatform\.com/videos/

Currently, these ignores are typically manually added when someone sees it. I know we've grabbed some of those URLs thousands of times, but others were never covered before. Because the contents on these hosts don't change with time, ignoring them if they've ever been grabbed before by some AB job should be fine. However, job starting URLs should not be checked against the dedupe list so that they can be saved again if needed – specifically, this means that URL table entries with level = 0 would always be retrieved.

An implementation would probably keep the dedupe DB and the list of URL patterns to be checked against it on the control node. The latter is pushed to the pipelines (and updated if it changes), then the pipeline queries the DB on encountering a matching URL. TBD is whether the pipeline should be able to directly add entries to the DB or whether they should come from the CDXs in the AB collection. The latter is more trustworthy (and also covers the unfortunate case when archives are lost between retrieval and IA upload) but adds a delay which can still lead to repeated retrieval. Alternatively, pipelines could add a temporary entry which gets dropped after a few days if it isn't confirmed by the CDXs.

@JustAnotherArchivist
Copy link
Contributor Author

JustAnotherArchivist commented Jun 6, 2020

  • New York Times videos: ^https?://vp\.nyt\.com/ and ^https?://video1\.nytimes\.com/
  • Videos hosted on JW Player: ^https?://cdn\.jwplayer\.com/videos/ (Note, these are different from the content.jwplatform.com one above. cdn.jwplayer.com serves various things by customers, content.jwplatform.com seems to only have FastCo videos.)

An alternative solution would be to dedupe based on data type or size, but that would require a new download every time and might slow down some crawls massively. If we go down this road, we should write revisit records for those; wpull already has support for that, it would just have to be activated and the remote calls implemented through a custom URLTable.

@JustAnotherArchivist
Copy link
Contributor Author

JustAnotherArchivist commented Nov 7, 2020

  • Washington Post videos: ^https?://d21rhj7n383afu\.cloudfront\.net/washpost-production/ and ^https?://videos\.posttv\.com/washpost-production/
  • USA Today: ^https?://videos\.usatoday\.net/Brightcove2/
  • AnyClip: ^https?://cdn([1-9]|1\d|20)\.anyclip\.com/.*\.mp4$
  • NYT, rarer than the above: ^https?://int\.nyt\.com/data/videotape/finished/.*\.mp4$ (e.g. job 2s9go7u96uf6eyhtlpemdpouu)
  • Wall Street Journal: ^https?://m\.wsj\.net/video/.*\.mp4$
  • ESPN: ^https?://media\.video-cdn\.espn\.com/.*\.mp4$ and ^https?://media\.video-origin\.espn\.com/.*\.mp4$
  • IGN: ^https?://assets\d+\.ign\.com/videos/zencoder/.*\.mp4$ and ^https?://s3\.amazonaws\.com/o\.videoarchive\.ign\.com/.*\.mp4$
  • MLB: ^https?://(cuts\.diamond|mlb-cuts-diamond)\.mlb\.com/FORGE/.*\.mp4$

@JustAnotherArchivist
Copy link
Contributor Author

Alternative for a proper global dedupe (which would likely require changes in wpull because the URLTable methods aren't async): special ignores that send the URL to a logger. Then we regularly dedupe what the logger receives and run those URLs separately in !ao < jobs, e.g. weekly or monthly (automated).

JustAnotherArchivist added a commit to JustAnotherArchivist/ArchiveBot that referenced this issue Nov 14, 2020
This igset is only intended as a temporary workaround until ArchiveTeam#443 is implemented properly.
Does not include the JW Player customer videos as those are not as frequent as the FastCo ones.
This was referenced Jan 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant