Skip to content

gustavo-bertoldi/CC-MapReducer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Anagram finding MapReduce pipeline

Implementation of a MapReduce pipeline to find anagrams in a large dataset. The code is implemented using Node.js and Google Cloud infrastructure.

Running

To run the pipeline, place the provided key.json file in the root directory of the project. This file contains the credentials for invoking the pipeline. The input is assumed to be every .txt file in the defined input directory. (gs://mrcw/input/).

npm install
npm run start

This will trigger the pipeline for the collection of 100 ebooks provided for this coursework. The result is stored in the output directory (gs://mrcw/output/). The name of the output file and the execution time will be logged in the console. Also, a http address allowing to download the output file will be provided.

Input and Output buckets

  • Input: gs://mrcw/input/
  • Output: gs://mrcw/output/

Pipeline architecture

The pipeline is composed of 4 phases and 2 helper functions, which are in order:

  • Start: This phase is responsible for scanning the input directory and creating a reader job for each input file.

  • Reader: This phase reads the text input, split it into words and removes words in the Stop Words list and non-alphabetic characters. The output of this phase is a comma-separated list of words.

  • Mapper: This phase maps each word to a key-value pair, where the key is the word sorted by character in alphabetical order, and the value is the original word. The output of this phase is one key-value pair per line.

  • Shuffler: This phase uses a hash of the key, which is the sorted word, to map each key-value pair to a specific output, determined by the hash. This is important to direct the same key-value pairs to the same reducer. The output of this phase is one key-value pair per line. Note the reducers need to have access to all instances of a same hash, therefore the reducers are only triggered when all shufflers have finished their jobs, one reducer per hash index.

  • Reducer: This phase receives as input all files with a same hash generated by the shufflers. The output is a set of anagrams for each key, where the set is sorted alphabetically.

  • Cleaner: This phase joins all reducer outputs into a single output file and place it in the output directory (gs://mrcw/output/). Moreover, it cleans up all intermediate files.

The source code for each phase can be found in the src/index.js file.

Infrastructure

The whole infrastructure is deployed on Google Cloud Platform.

  • Each phase of the pipeline is deployed as a separate Cloud Function.

  • Input, output and intermediate files are stored in Google Cloud Storage.

  • The coordination between jobs in the pipeline is done using Google Cloud Pub/Sub.

  • For updating the Cloud Functions, Cloud Build is used with the provided cloudbuild.yaml file. A push to the master branch will trigger the new version deployment automatically. The needed parameters are defined in the src/.env.gc.yaml file and exported as environment variables to the functions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published