String Distances for Near-duplicate Detection

Authors: Iulia Dănăilă, Liviu P. Dinu, Vlad Niculae, and Octavia-Maria Șulea

Polibits, Vol. 45, pp. 21-25, 2012.

Abstract: Near-duplicate detection is important when dealing with large, noisy databases in data mining tasks. In this paper, we present the results of applying the Rank distance and the Smith-Waterman distance, along with more popular string similarity measures such as the Levenshtein distance, together with a disjoint set data structure, for the problem of near-duplicate detection.

Keywords: Near-duplicate detection, string similarity measures, database, data mining

PDF: String Distances for Near-duplicate Detection
PDF: String Distances for Near-duplicate Detection