Implementation of a MapReduce pipeline to find anagrams in a large dataset. The code is implemented using Node.js and Google Cloud infrastructure.
To run the pipeline, place the provided key.json
file in the root directory of the project. This file contains the credentials for invoking the pipeline. The input is assumed to be every .txt
file in the defined input directory. (gs://mrcw/input/
).
npm install
npm run start
This will trigger the pipeline for the collection of 100 ebooks provided for this coursework. The result is stored in the output directory (gs://mrcw/output/
). The name of the output file and the execution time will be logged in the console. Also, a http address allowing to download the output file will be provided.
- Input:
gs://mrcw/input/
- Output:
gs://mrcw/output/
The pipeline is composed of 4 phases and 2 helper functions, which are in order:
-
Start: This phase is responsible for scanning the input directory and creating a
reader
job for each input file. -
Reader: This phase reads the text input, split it into words and removes words in the Stop Words list and non-alphabetic characters. The output of this phase is a comma-separated list of words.
-
Mapper: This phase maps each word to a key-value pair, where the key is the word sorted by character in alphabetical order, and the value is the original word. The output of this phase is one key-value pair per line.
-
Shuffler: This phase uses a hash of the key, which is the sorted word, to map each key-value pair to a specific output, determined by the hash. This is important to direct the same key-value pairs to the same reducer. The output of this phase is one key-value pair per line. Note the reducers need to have access to all instances of a same hash, therefore the reducers are only triggered when all shufflers have finished their jobs, one reducer per hash index.
-
Reducer: This phase receives as input all files with a same hash generated by the shufflers. The output is a set of anagrams for each key, where the set is sorted alphabetically.
-
Cleaner: This phase joins all reducer outputs into a single output file and place it in the output directory (
gs://mrcw/output/
). Moreover, it cleans up all intermediate files.
The source code for each phase can be found in the src/index.js
file.
The whole infrastructure is deployed on Google Cloud Platform.
-
Each phase of the pipeline is deployed as a separate Cloud Function.
-
Input, output and intermediate files are stored in Google Cloud Storage.
-
The coordination between jobs in the pipeline is done using Google Cloud Pub/Sub.
-
For updating the Cloud Functions, Cloud Build is used with the provided
cloudbuild.yaml
file. A push to themaster
branch will trigger the new version deployment automatically. The needed parameters are defined in thesrc/.env.gc.yaml
file and exported as environment variables to the functions.