Skip to content

iifeoluwa/hn-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HackerNews Scraper

Description


This project crawls the HackerNews website and scrapes data about the current top stories. The scraped stories are then written to STDOUT in JSON format.

HackerNews provides an API that enables clients consume information about the top posts. For our use case though, consuming the API would have proved inefficient because, in the worst case scenario we would need to make 100+ network requests to fetch the top 100 stories.

This solution makes a maximum of 4 network requests, as opposed to 100+ API calls it would have taken to fetch the top 100 posts with the HackerNews API.

How To Run.

  1. Download and install Node.js here. Skip this step if you already have Node installed.

  2. Download and install Git here. Skip this as well if you have Git already installed on your computer.

  3. Open a command line window from a newly created folder and run the following command;

git clone https://github.com/iifeoluwa/hn-scraper.git .
  1. From the same command line window, run npm install -g

After completing the steps above, you can run the tool from any command line window using hackernews. It also accepts a --posts argument that specifies the number of stories it should return.

To run tests, run npm test from the project directory.

Sample Usage

hackernews --posts 1

// Writes to STDOUT
[ { title: 'Lambda School Announces $14M Series A Led by GV',
    uri: 'https://lambdaschool.com/blog/lambda-school-announces-14-million-series-a-led-by-gv/',
    author: 'tosh',
    points: '31',
    comments: '17',
    rank: '1' } ]

Libraries Used

The following libraries were used to create this tool;

  • Got: A lightweight HTTP request library. Used this because the project required making simple GET requests, and it is one of the lightest, actively maintained library for making HTTP requests.
  • Cheerio: Cheerio was used to parse the HTML document and extract the needed data from the file. It provides an expressive API that makes it easy to find specific information in documents.
  • Minimist: Parses the arguments passed to hackernews tool. Makes it easier to handle and validate inputs.
  • joi: Tool used to enforce validation rules and ensure only validated stories are retrieved.

About

Scrape HackerNews

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages