全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...
-  2018 

基于网络文本的汉语多词表达抽取方法
Extraction of Chinese multiword expressions based on Web text

DOI: 10.6040/j.issn.1671-9352.1.2017.060

Keywords: 多词表达,左右熵,分词,增强互信息,SVM,
SVM
,MWEs,left and right entropy,enhanced mutual information,word segmentation

Full-Text   Cite this paper   Add to My Lib

Abstract:

摘要: 多词表达(multiword expressions, MWEs)是自然语言中一类固定或半固定搭配的语言单元,特别在网络文本中,多词表达频繁出现,给分词和后续文本理解带来了巨大挑战,因此,面向网络文本提出了一种双层抽取策略来实现多词表达的识别。第一层次,利用基于左右熵联合增强互信息的算法来实现多词表达的初步抽取;第二层次,在第一层次获得的多词表达候选列表的基础上,利用SVM分类器,构建上下文和词向量特征,进行多词表达与非多词表达的分类,实现多词表达候选列表的进一步过滤。经过实验测试,在5 000条微博语料上,第一层次获得的多词表达的F值为84.92%,第二层次多词表达识别的F值为89.58%,相比于基线系统,性能有很大的提升。实验结果表明,双层抽取策略能够实现网络多词表达的有效抽取,并能有效改善分词结果。
Abstract: A Multiword Expression is a kind of fixed and semi-fixed collocation in natural language, especially in network text, MWEs appear frequently, which brings a great challenge to the subsequent segmentation and text comprehension. Therefore, we propose a double-layer extraction strategy to achieve the recognition of MWEs in this paper. In the first layer, we use the LRE+EMI algorithm to achieve the initial extraction of MWEs; In the second layer, we use SVM classifier and construct the characteristics of context and word vector to classify the MWEs and non-MWEs, in order to further filter the MWEs candidate list on the basis of the MWEs candidate list got from the first layer. After the experiment, the F value of MWEs reached 84.92% in the first layer and the F value of MWEs reached 89.58% in the second layer, which have greatly improved performance compared with the baseline system. The experimental result shows that the double-layer extraction strategy can availably extract MWEs, and can effectively improve the segmentation results

References

[1]  BALDWIN T, BANNARD C, TANAKA T, et al. An empirical model of multiword Expression decomposability[C] // Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment. Sapporo: ACL, 2003: 89-96.
[2]  PIAO S S, SUN Guangfan, RAYSON P, et al. Automatic extraction of Chinese multiword expressions with a statistical tool[C] // Proceedings of the Workshop on Multi-word Expressions in a Multilingual Context. Trento: J Weeds, 2006: 17-24.
[3]  周练.Word2vec的工作原理及应用探究[J].科技情报开发与经济,2015,25(2):145-148. ZHOU Lian. Word2vecs working principle and application to explore[J]. Science and Technology Information Development and Economy, 2015, 25(2): 145-148.
[4]  JAYADEVA, KHEMCHANDANI R, CHANDRA S. Twin support vector machines for pattern classification[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2007, 29(5):905-910.
[5]  焦春鹏. 基于二分类SVM的多分类方法比较研究[D].西安电子科技大学,2011. JIAO Chunpeng. A comparative study of multi taxonomy based on two classification SVM[D]. Xian: Xian Electronic and Science University, 2011.
[6]  XIAO Jian, XU Jian, XU Xiaolan. Automatic extraction and alignment of multiword expressions from English-Chinese comparable corpus[J]. Computer Engineering & Applications, 2010, 46(31):130-131.
[7]  ZHANG W, YOSHIDA T, TANG X, et al. Improving effectiveness of mutual information for substan-tival multiword expression extraction[J]. Expert Systems with Applications, 2009, 36(8):10919-10930.
[8]  BU Fan, ZHU Xiaoyan, LI Ming. A new multiword expression metric and its applications[J]. Journal of Computer Science & Technology, 2011, 26(1):3-13.
[9]  DUAN Jianyong, LU Ruanzhan, WU Weilin, et al. A bio-inspired approach for multiword expression extraction[C] // Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Sydney: BPA Digital, 2006: 4876-4883.
[10]  ZHU H, ZHANG S. Extraction method of micro-blog new login word based on improved position-word probability[C] // International Conference on Applications & Techniques in Cyber Security & Intelligence. Basel:Springer International Publishing AG, 2017.
[11]  缪苗. VNC结构多词表达的抽取与分类[D].北京:北京邮电大学, 2011: 55-60. MIAO Miao. The extraction and classification of multiword expression in VNC structure[D]. Beijing: Beijing University of Posts and Telecommunications, 2011: 55-60.
[12]  REN Z, Lü Y, CAO J, et al. Improving statistical machine translation using domain bilingual multiword expressions[C] // Proceedings of the 2009 Workshop on Multiword Expressions. Suntec: ACL-IJCNLP, 2009: 47.
[13]  JACKENDOFF R. The architecture of the language faculty[M]. Cambridge: MIT Press, 1997.
[14]  CASELI H M, RAMISCH C, NUNES M G V, et al. Alignment-based extraction of multiword expression[J]. Language Resources and Evaluation, 2009, 44(1/2):59-77.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133