|
计算机应用研究 2013
Approach to Webpage segmentation andinformation extraction for vertical Websites
|
Abstract:
Analyzing existed Webpage segmentation algorithms along with their corresponding usage conditions, this paper investigated a vertical Webpage segmentation and information extraction method. Based on DOM tree, this paper proposed the notion of content crowding level, segmented the Webpage by using segment tag which obtained by statistical method and the mapping of cascading style sheets, and then extracted information from each segment by using text recognition and prefix matching. Given actual project requirements, a page segment and information extractor for vertical Webpage was designed and implemented. The experimental results show that the proposed method has achieved good performance and meets its needs.