|
A New Approach for Web Information ExtractionKeywords: information extraction , web page content extraction , removing noise content. Abstract: With the exponentially growing amount of information available on the Internet, an effective technique for users to discern the useful information from the unnecessary information is urgently required. Cleaning web pages for web data extraction becomes critical for improving performance of information retrieval and information extraction. So, we investigate to remove various noise patterns in Web pages instead of extracting relevant content from Web pages to get main content information. To solve this problem, we put forward an extracting main content method which firstly removes the usual noise and the candidate nodes without any main content information from web pages, and makes use of the relation of content text length, the length of anchor text and the number of punctuation marks to extract the main content. In this paper, we focus on removing noise and utilization of all kinds of content-characteristics, experiments show that this approach can enhance the universality and accuracy in extracting the body text of web pages
|