This tutorial cover the basics of web scraping and it is intended for Python beginners,intermediate or any one who is interested in building data mining bots. The content of this repository is totally free for use. Please see note on licensing to learn more about copyright issues. Contributions are most welcome. Kindly open an issue or make a pull request for your contributions and feedback.
The tutorial is structured into five (5) sections, namely:
- Preamble
- Getting Started with Web Scraping
- Scraping JavaScript-rendering pages
- Scrapy Framework
- Optimization and Extensions (Moving Forward)
-
Preamble
- Installation and setup of Python Interpreter, IDE/Text Editor
- Web scraping theories (Introduction to web scraping, robots.txt, Sitemaps, legal policy and more)
- Crawling Vs Scraping
-
Getting Started with Web Scraping
- Devtool inspection for DOM elements and Network request & response
- Introduction to Requests and BeautifulSoup
- Project 2-1: Extracting data on smartphone from Jumia Nigeria e-commerce website.
https://www.jumia.com.ng/smartphones/
-
Scraping JavaScript-rendered pages
- Browser automation with Selenium
- Project 3-1: Scraping frequently bought products using Selenium
https://mall.industry.siemens.com/mall/en/us/Catalog/Product/3RV20111KA15
- API as an alternative (mimicking API calls)
- Project 3-2: Scraping frequently bought products using API requests
https://mall.industry.siemens.com/mall/en/us/Catalog/Product/3RV20111KA15
-
Scrapy Framework
- Installation and CLI-tools commands
- Framework components explained
- Learning Xpath
- Project 4-1: Scraping free Computer Science courses & MOOCs from Class Central with infinite scroll
(https://www.class-central.com/subject/cs)
- Project 4-2: Simple web crawler
http://books.toscrape.com/
-
Optimization and Extension
- Rotating Proxies & User Agents
- Scheduling Scraping tasks (Cronjobs)
- Using Multithreading/Multiprocessing in web scraping
- Storing data to SQL/NoSQL databases
- Porting scripts to desktop/web apps and CLI tools
- Analysis, Visualization and modelling of mined data