|
- 2018
基于网络文本的汉语多词表达抽取方法
|
Abstract:
摘要: 多词表达(multiword expressions, MWEs)是自然语言中一类固定或半固定搭配的语言单元,特别在网络文本中,多词表达频繁出现,给分词和后续文本理解带来了巨大挑战,因此,面向网络文本提出了一种双层抽取策略来实现多词表达的识别。第一层次,利用基于左右熵联合增强互信息的算法来实现多词表达的初步抽取;第二层次,在第一层次获得的多词表达候选列表的基础上,利用SVM分类器,构建上下文和词向量特征,进行多词表达与非多词表达的分类,实现多词表达候选列表的进一步过滤。经过实验测试,在5 000条微博语料上,第一层次获得的多词表达的F值为84.92%,第二层次多词表达识别的F值为89.58%,相比于基线系统,性能有很大的提升。实验结果表明,双层抽取策略能够实现网络多词表达的有效抽取,并能有效改善分词结果。
Abstract: A Multiword Expression is a kind of fixed and semi-fixed collocation in natural language, especially in network text, MWEs appear frequently, which brings a great challenge to the subsequent segmentation and text comprehension. Therefore, we propose a double-layer extraction strategy to achieve the recognition of MWEs in this paper. In the first layer, we use the LRE+EMI algorithm to achieve the initial extraction of MWEs; In the second layer, we use SVM classifier and construct the characteristics of context and word vector to classify the MWEs and non-MWEs, in order to further filter the MWEs candidate list on the basis of the MWEs candidate list got from the first layer. After the experiment, the F value of MWEs reached 84.92% in the first layer and the F value of MWEs reached 89.58% in the second layer, which have greatly improved performance compared with the baseline system. The experimental result shows that the double-layer extraction strategy can availably extract MWEs, and can effectively improve the segmentation results
[1] | BALDWIN T, BANNARD C, TANAKA T, et al. An empirical model of multiword Expression decomposability[C] // Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment. Sapporo: ACL, 2003: 89-96. |
[2] | PIAO S S, SUN Guangfan, RAYSON P, et al. Automatic extraction of Chinese multiword expressions with a statistical tool[C] // Proceedings of the Workshop on Multi-word Expressions in a Multilingual Context. Trento: J Weeds, 2006: 17-24. |
[3] | 周练.Word2vec的工作原理及应用探究[J].科技情报开发与经济,2015,25(2):145-148. ZHOU Lian. Word2vecs working principle and application to explore[J]. Science and Technology Information Development and Economy, 2015, 25(2): 145-148. |
[4] | JAYADEVA, KHEMCHANDANI R, CHANDRA S. Twin support vector machines for pattern classification[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2007, 29(5):905-910. |
[5] | 焦春鹏. 基于二分类SVM的多分类方法比较研究[D].西安电子科技大学,2011. JIAO Chunpeng. A comparative study of multi taxonomy based on two classification SVM[D]. Xian: Xian Electronic and Science University, 2011. |
[6] | XIAO Jian, XU Jian, XU Xiaolan. Automatic extraction and alignment of multiword expressions from English-Chinese comparable corpus[J]. Computer Engineering & Applications, 2010, 46(31):130-131. |
[7] | ZHANG W, YOSHIDA T, TANG X, et al. Improving effectiveness of mutual information for substan-tival multiword expression extraction[J]. Expert Systems with Applications, 2009, 36(8):10919-10930. |
[8] | BU Fan, ZHU Xiaoyan, LI Ming. A new multiword expression metric and its applications[J]. Journal of Computer Science & Technology, 2011, 26(1):3-13. |
[9] | DUAN Jianyong, LU Ruanzhan, WU Weilin, et al. A bio-inspired approach for multiword expression extraction[C] // Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Sydney: BPA Digital, 2006: 4876-4883. |
[10] | ZHU H, ZHANG S. Extraction method of micro-blog new login word based on improved position-word probability[C] // International Conference on Applications & Techniques in Cyber Security & Intelligence. Basel:Springer International Publishing AG, 2017. |
[11] | 缪苗. VNC结构多词表达的抽取与分类[D].北京:北京邮电大学, 2011: 55-60. MIAO Miao. The extraction and classification of multiword expression in VNC structure[D]. Beijing: Beijing University of Posts and Telecommunications, 2011: 55-60. |
[12] | REN Z, Lü Y, CAO J, et al. Improving statistical machine translation using domain bilingual multiword expressions[C] // Proceedings of the 2009 Workshop on Multiword Expressions. Suntec: ACL-IJCNLP, 2009: 47. |
[13] | JACKENDOFF R. The architecture of the language faculty[M]. Cambridge: MIT Press, 1997. |
[14] | CASELI H M, RAMISCH C, NUNES M G V, et al. Alignment-based extraction of multiword expression[J]. Language Resources and Evaluation, 2009, 44(1/2):59-77. |