Skip to content

Implementation of the MapReduce Bloom filter construction algorithm using the Hadoop and Spark framework.

License

Notifications You must be signed in to change notification settings

biagiocornacchia/bloom-filters-in-mapreduce

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bloom Filters in MapReduce

Project for Cloud Computing course at University of Pisa (MSc in Computer Engineering).

The aim of the project was to build a Bloom filter over the ratings of movies listed in the IMDb datasets. In paricular, we have to build one Bloom filter per average rating where the latter are rounded to the closest integer value.

The project consists in :

  1. an implementation of the MapReduce Bloom filter construction algorithm using the Hadoop framework
  2. an implementaton the MapReduce Bloom filter construction algorithm using the Spark framework

Some requirements

In the Hadoop implementation, we had to use the following classes:

  • org.apache.hadoop.mapreduce.lib.input.NLineInputFormat: splits N lines of input as one split
  • org.apache.hadoop.util.hash.Hash.MURMUR_HASH: the hash function family to use

In the Spark implementation, we had to use/implement analogous classes.

Hadoop Execution

To start the execution on the namenode:

hadoop jar hadoop-bloom-filters-1.0-SNAPSHOT.jar it.unipi.hadoop.Main data.tsv output_dir 100000 0.0001 1

where:

  • data.tsv is the path of the input file on HDFS
  • output_dir is the name of the output directory
  • 100000 is the number of lines of each split
  • 0.0001 is the p value chosen
  • 1 is the version to put into execution for the job2. In the version 1 the Mapper emits an array of bloom filters exploiting the in-mapper combiner pattern. In the version 2 the Mapper emits the bit positions (set to one) of the bloom filter.

Spark Execution

To start the execution of Spark:

spark-submit --master yarn main.py data.tsv aggregate_by_key false 0

where:

  • aggregate_by_key is the type of job2 to put into execution

Authors

About

Implementation of the MapReduce Bloom filter construction algorithm using the Hadoop and Spark framework.

Topics

Resources

License

Stars

Watchers

Forks