Skip to content

shirokovnv/webcrawler

Repository files navigation

Webcrawler

ci.yml

The service for crawling websites (experimental)

Dependencies

Project setup

From the project root, inside shell, run:

  • make pull to pull latest images
  • make init to install fresh dependencies
  • make up to run app containers

Now you can visit localhost:4000 from your browser.

  • make down - to extinguish running containers
  • make help - for additional commands

Howitworks

  1. The user adds new source URL -> new async job started
  2. Inside the job:
  • Normalize URL (validate schema, remove trailing slash, etc...)
  • Store link in DB, if link already exists, than exit
  • Parse HTML links and metadata
  • Store it in different tables
  • Normalize links, check wether it relational or not.
  • Check links are external
  • For each non-external link -> schedule new async job with some random interval
  1. Thats literally it

To see it in action, go to the localhost:4000/crawl and type any kind of URL.

To see some search results visit localhost:4000/search.

Database schema

The default keyspace is storage

Tables:

  • site_statistics contains source URLs and counting parsed links
  • sites contains URL and HTML parsed
  • sites_by_meta contains URL and parsed metadata

For LIKE-style search queries SASI index needs to be configured.

See schema.cql and cassandra.yaml for more detail.

Useful links

License

MIT. Please see the license file for more information.