Skip to content

Utility for finding similar sentences on wikipedia

Notifications You must be signed in to change notification settings

seweissman/wikiduper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wikiduper

Hadoop package for similar sentence detection on Wikipedia using minhash [1]. For output sets of similar sentences see "Data sets" section below.

Running the code

To produce similar sentence clusters, run the pipeline script:

scripts/run-pipeline.sh <wiki-dump-file> <output-prefix-string> <nHash> <bits> <k> <n> <l>

E.g.,

 scripts/run-pipeline.sh enwiki-20130708-pages-articles-multistream.xml enwiki-20130708-sentences 20 60 10 10 12

The pipeline script preprocess wikipedia data, runs the minhash algorithm, and performs a transitive closure to merge clusters that share at least one sentence. Finally, it maps clusters back to human-readable sentences.

the parameters of the pipeline script are as follows

  • wiki-dump-file - A wikipedia dump (enwiki), can be obtained from https://dumps.wikimedia.org/backup-index.html.
  • output-prefix-string - A prefix for naming output files
  • nHash - The size of the hash family. The hashes that make up the minhash signatures will be drawn from this pool.
  • bits - The number of bits in each hash.
  • k - The number of hashes in each signature.
  • n - The number of signatures to generate for each sentence.
  • l - The length of a shingle.

The output of the pipeline script is a single text file named with the scheme output-prefix-string-nHash-k-n-l-bits, e.g. enwiki-20130708-sentences-20-10-10-12-60. The text file contains key, value pairs where the key is a cluster number, and the value is a pair giving the title of the artible, followed by the sentence. The file should be grouped by cluster number.

Parameter tuning

Exact tuning of parameters for minhash is difficult. The authors have found parameters in the range of n=10, k=8, and l=10 to be decent. For fixed k, as n increases, the number of false positives increases, meaning you will have more bad sentence clusters. For fixed n, as k increases, the number of false negatives increases, which means that you will have fewer sentence clusters, but more accuracy. Best practice would be to allow for more false positives, but have a secondary processing step (e.g. an edit distance test) to weed out bad matches.

Running locally

For convenience, a script to run the pipeline locally is also provided. This script is scripts/run-pipeline-local.sh and takes the same arguments as above. The local version of the script uses the local version of hadoop that can be run out of the box without a cluster set-up. A smaller set of test data taken from a larger Wikipedia dump and suitable for use with the local version of the pipeline can be found in data/Maryland.xml.

License

Code licensed under Apache License v2.0 (http://www.apache.org/licenses/LICENSE-2.0)

Data sets

Similar sentences from English Wikipedia (20150304) DOI

Notes

  1. For a description of the minhash algorithm, see A. Rajaraman, J. Leskovec, and J. D. Ullman. Mining of Massive Datasets. January 2011.

About

Utility for finding similar sentences on wikipedia

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages