Skip to content

Python tool for archiving web pages through Internet Archive Wayback Machine

License

Notifications You must be signed in to change notification settings

ocftw/wayback-machine-saver

 
 

Repository files navigation

PRs Welcome Conventional Commits Code style: black Github Actions PyPI Package latest release PyPI Package download count (per month) Supported versions

Wayback Machine Saver

Python tool for archiving web pages through Internet Archive Wayback Machine

Getting Started

Prerequisites

Installation

It's recommended to use tools like pipx to install this command-line tool.

pipx install wayback-machine-saver

Usage

Save pages

Save URLs from the input file to Internet Archive - Wayback Machine

wayback-machine-saver save-pages FILENAME

Argument

  • FILENAME: filename to the file that consists of URLs to save

e.g.,

https://example.com
https://another-example.com

options

  • --deliminator TEXT [default: "\n"]
  • --error-log-filename TEXT [default: save-pages-error-log-"timestamp".csv]

Get latest archive urls

After the URLs have been saved, Internet Archive - Wayback Machine will snap-shot the page to their database and create a timestamp. You can access the latest one through http://web.archive.org/web/[Your URL] and it will be redirected to http://web.archive.org/web/[timestamp]/[Your URL]. This command is used to get the redirected URLs.

wayback-machine-saver get-latest-archive-urls FILENAME

Argument

  • FILENAME: filename to the file that consists of URLs to retrieved

e.g.,

https://example.com
https://another-example.com

options

  • --deliminator TEXT [default: "\n"]
  • --output-filename TEXT [default: retrieved-urls-"timestamp".csv]]
  • --error-log-filename TEXT [default: get-url-error-log-"timestamp".csv]

Configuration

Wayback Machine Saves supports configurating through environment variable. You can run export VARIABLE=VALUE before running the script to change the behavior.

  • WAYBACK_MACHINE_SAVER_RETRY_TIMES
    • times to retry (default: 3)
  • HTTPX_TIMEOUT
    • timeout for all GET operations (default: 10)

Contributing

See Contributing

Authors

Wei Lee [email protected]

Created from Lee-W/cookiecutter-python-template version 0.9.0

About

Python tool for archiving web pages through Internet Archive Wayback Machine

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 97.1%
  • Dockerfile 2.9%