Logo Goletty

A Domain-Independent Data Cleaning Algorithm for Detecting Similar-Duplicates
Journal Title Journal of Computers
Journal Abbreviation jcp
Publisher Group Academy Publisher
Website http://ojs.academypublisher.com
PDF (505 kb)
   
Title A Domain-Independent Data Cleaning Algorithm for Detecting Similar-Duplicates
Authors Rahaman, G.M. Atiqur; Rahman, Ashiqur; Ripon, Kazi Shah Nawaz
Abstract Data mining algorithms generally assume that data will be clean and consistent. However, in practice, this is not always the case, and for this reason the detection and elimination of duplicate records is an important part of data cleaning. The presence of similar-duplicate records causes over-representation of data. If the database contains different representations of the same data, the results obtained from the data mining algorithm will be erroneous. The detection of similar-duplicate records is a difficult task, especially when the records are domain-independent. In this paper, we propose a novel domain-independent technique for better reconciling the similar-duplicate records. We also introduce new ideas for making similar-duplicate detection algorithms faster and more efficient. In addition, a significant modification of the transitivity rule is also proposed. Finally, we propose an algorithm that incorporates all these techniques for similar-duplicate detection into a domain-independent environment. The performance of the proposed method has been compared to other methods and the superiority of the proposed method has been confirmed by the experimental results.
Publisher ACADEMY PUBLISHER
Date 2010-12-01
Source Journal of Computers Vol 5, No 12 (2010): Special Issue: Selected Papers of the IEEE International Conference on Compute
Rights Copyright © ACADEMY PUBLISHER - All Rights Reserved.To request permission, please check out URL: http://www.academypublisher.com/copyrightpermission.html.

 

See other article in the same Issue


Goletty © 2024