%0 Journal Article %T 基于网络文本的汉语多词表达抽取方法<br>Extraction of Chinese multiword expressions based on Web text %A 龚双双 %A 陈钰枫 %A 徐金安 %A 张玉洁< %A br> %A GONG Shuang-shuang %A CHEN Yu-feng %A XU Jin-an %A ZHANG Yu-jie %J 山东大学学报(理学版) %D 2018 %R 10.6040/j.issn.1671-9352.1.2017.060 %X 摘要: 多词表达(multiword expressions, MWEs)是自然语言中一类固定或半固定搭配的语言单元,特别在网络文本中,多词表达频繁出现,给分词和后续文本理解带来了巨大挑战,因此,面向网络文本提出了一种双层抽取策略来实现多词表达的识别。第一层次,利用基于左右熵联合增强互信息的算法来实现多词表达的初步抽取;第二层次,在第一层次获得的多词表达候选列表的基础上,利用SVM分类器,构建上下文和词向量特征,进行多词表达与非多词表达的分类,实现多词表达候选列表的进一步过滤。经过实验测试,在5 000条微博语料上,第一层次获得的多词表达的F值为84.92%,第二层次多词表达识别的F值为89.58%,相比于基线系统,性能有很大的提升。实验结果表明,双层抽取策略能够实现网络多词表达的有效抽取,并能有效改善分词结果。<br>Abstract: A Multiword Expression is a kind of fixed and semi-fixed collocation in natural language, especially in network text, MWEs appear frequently, which brings a great challenge to the subsequent segmentation and text comprehension. Therefore, we propose a double-layer extraction strategy to achieve the recognition of MWEs in this paper. In the first layer, we use the LRE+EMI algorithm to achieve the initial extraction of MWEs; In the second layer, we use SVM classifier and construct the characteristics of context and word vector to classify the MWEs and non-MWEs, in order to further filter the MWEs candidate list on the basis of the MWEs candidate list got from the first layer. After the experiment, the F value of MWEs reached 84.92% in the first layer and the F value of MWEs reached 89.58% in the second layer, which have greatly improved performance compared with the baseline system. The experimental result shows that the double-layer extraction strategy can availably extract MWEs, and can effectively improve the segmentation results %K 多词表达 %K 左右熵 %K 分词 %K 增强互信息 %K SVM %K < %K br> %K SVM %K MWEs %K left and right entropy %K enhanced mutual information %K word segmentation %U http://lxbwk.njournal.sdu.edu.cn/CN/10.6040/j.issn.1671-9352.1.2017.060