|
- 2018
Levenshtein算法优化及在题库判重中的应用DOI: 10.3969/j.issn.1000-5641.2018.05.013 Keywords: 题库匹配, 文本相似度, Levenshtein编辑距离Key words: bank match text similarity Levenshtein edit distance Abstract: 摘要 为了解决Levenshtein距离算法在长文本和大规模匹配效率的不足,本文针对Levenshtein距离算法提出一种提前终止的优化策略.首先根据Levenshtein距离矩阵中元素内在的联系,归纳总结出一个递推关系式.再依据此递推关系式,提出一种提前终止策略,可提前判断两个文本是否满足预先设定的相似度阈值.经过多个学科题库判重实验的佐证,本文的提前终止策略能显著减少计算时间.Abstract:In order to overcome the disadvantages of the Levenshtein distance algorithm for long text and large-scale matching, we propose an early termination strategy for the Levenshtein distance algorithm. Firstly, according to the intrinsic relationship between elements in the Levenshtein distance matrix, we sum up a recurrence relation. Based on this relation, an early termination strategy is proposed to determine early-on whether two texts satisfy the predefined similarity threshold. Through several tests on different subjects, it is demonstrated that the early termination strategy can significantly reduce calculation time.
|