Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overwrite only if file changed mode #41

Open
afonari opened this issue Mar 17, 2020 · 2 comments
Open

Overwrite only if file changed mode #41

afonari opened this issue Mar 17, 2020 · 2 comments

Comments

@afonari
Copy link

afonari commented Mar 17, 2020

Is it possible to only overwrite the file if the file changed since the last crawl?

@rajatomar788
Copy link
Owner

I don't seriously think it is possible in my capacity. If anyone has suggestions then I can sure implement it.

@BradKML
Copy link

BradKML commented Apr 2, 2023

Answer: this is not possible with merely checking URLs, but it is likely that the multimedia files do not change often, so it is likely that having a "do not update" list for multimedia would be more useful.

Instead for text pages, it would be more useful to first get the page creation date being touched. See here and here for reference. (It could be inaccurate however)
In Python there is a solution with urllib

from urllib.request import urlopen
urlopen("http://example.com").headers['last-modified']

Some other people have recommended the use of checksum instead, but that poses a risk on dynamically generated websites (especially with ads) that have content that constantly mutates (e.g. recommended reading lists).

There is no perfect solution, a person would have to make a sound judgement as to see which one is better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants