|
具有重叠集合约束的实体解析问题研究
|
Abstract:
本文研究了具有重叠集合约束的实体解析集合相似性连接问题。给定两个集合内元素为集合的集合以及一个常数c,找到数据集当中至少共享了c个共同元素的所有集合对。这一问题是许多领域诸如信息检索、数据挖掘和实体解析当中的基本操作。现有的方法都受到了O(n2)的限制,其中n是数据集总的大小。本文提出了一种算法复杂度为 的集合大小感知算法。集合大小感知算法根据集合大小将所有集合分为大集合与小集合并分别进行处理。本文通过现有的方法来处理大集合,对于小集合本文提出了集中启发式算法来提高算法性能。由于大集合与小集合的大小边界对于效率至关重要,本文还提出了一种有效的大小边界的选择方法来选择合适的大小边界。本文通过在真实数据集上的实验结果证明了该方法的有效性。
In this paper, the entity resolution set similarity connection problem with overlapping set constraints is studied. Given two sets whose elements are sets of sets and a constant c, find all sets that share at least c common elements in the data set. This problem is a fundamental operation in many fields such as information retrieval, data mining, and entity resolution. Existing methods are limited by O(n2), where n is the total size of the dataset. This paper presents a set size awareness algorithm with complexity. The set size sensing algorithm divides all sets into large sets and small sets according to the set size and processes them separately. In this paper, the existing methods are used to deal with large sets. For small sets, a centralized heuristic algorithm is pro-posed to improve the algorithm performance. Since the size boundary of large sets and small sets is very important for efficiency, this paper also proposes an effective size boundary selection method to select the appropriate size boundary. Experimental results on real data sets demonstrate the ef-fectiveness of the proposed method.
[1] | Satuluri, V. and Parthasarathy, S. (2012) Bayesian Locality Sensitive Hashing for Fast Similarity Search. Proceedings of the VLDB Endowment, 5, 430-441. https://doi.org/10.14778/2140436.2140440 |
[2] | Pennington, J., Socher, R. and Manning, C. (2014) Glove: Global Vectors for Word Representation. In: Proceedings of the 2014 Conference on Empir-ical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, 1532-1543. https://doi.org/10.3115/v1/D14-1162 |
[3] | Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J. (2013) Distributed Representations of Words and Phrases and Their Compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems, 2, 3111-3119. |
[4] | Lund, K. and Burgess, C. (1996) Producing High-Dimensional Semantic Spaces from Lexical Co-Occurrence. Behavior Research Methods, Instruments, & Comput-ers, 28, 203-208. https://doi.org/10.3758/BF03204766 |
[5] | Bullinaria, J.A. and Levy, J.P. (2007) Extracting Se-mantic Representations from Word Co-Occurrence Statistics: A Computational Study. Behavior Research Methods, 39, 510-526. https://doi.org/10.3758/BF03193020 |
[6] | Bayardo, R.J., Ma, Y. and Srikant, R. (2007) Scaling up All Pairs Similarity Search. In Proceedings of the 16th International Conference on World Wide Web, Association for Com-puting Machinery, New York, 131-140.
https://doi.org/10.1145/1242572.1242591 |
[7] | Li, C., Lu, J. and Lu, Y. (2008) Efficient Merging and Filtering Al-gorithms for Approximate String Searches. 2008 IEEE 24th International Conference on Data Engineering, Cancun, 7-12 April 2008, 257-266.
https://doi.org/10.1109/ICDE.2008.4497434 |
[8] | Wang, J., Li, G. and Feng, J. (2012) Can We Beat the Prefix Fil-tering?: An Adaptive Framework for Similarity Join and Search. In: Proceedings of the 2012 ACM SIGMOD Interna-tional Conference on Management of Data, Association for Computing Machinery, New York, 85-96. https://doi.org/10.1145/2213836.2213847 |
[9] | Malik, A.S., Boyko, O., Atkar, N. and Young, W.F. (2001) A Comparative Study of MR Imaging Profile of Titanium Pedicle Screws. Acta Radiologica, 42, 291-293. https://doi.org/10.1080/028418501127346846 |