DataIntegrationDuplicateDetection

[Uni] Data Integration Excercise 3 - Data Deduplication

In this excercise we were supposed to parse a dataset and find duplicate rows in it. Hereby rows don't have to be exact duplicates, but are usually fuzzy-duplicates (including typos, missing attributes etc). Main problem was the size of the dataset (94.000 rows) which made the brute-force approach really time and memory consuming. A hand full of other algorithms had to be implemented in order to achieve a decent runtime.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DataIntegrationDuplicateDetection

Files

README.md

Latest commit

History

README.md

File metadata and controls

DataIntegrationDuplicateDetection