Skip to content

Latest commit

 

History

History
5 lines (4 loc) · 518 Bytes

File metadata and controls

5 lines (4 loc) · 518 Bytes

DataIntegrationDuplicateDetection

[Uni] Data Integration Excercise 3 - Data Deduplication

In this excercise we were supposed to parse a dataset and find duplicate rows in it. Hereby rows don't have to be exact duplicates, but are usually fuzzy-duplicates (including typos, missing attributes etc). Main problem was the size of the dataset (94.000 rows) which made the brute-force approach really time and memory consuming. A hand full of other algorithms had to be implemented in order to achieve a decent runtime.