|
A Webpage Classification Algorithm Concerning Webpage Design CharacteristicsKeywords: Tag-region , Webpage Classification , Webpage Design , Keyword Extraction , Knowledge Management Abstract: Owing to the booming growth of Internet technology, the number of web documents has significantly increased over the Internet. If the webpage can be effectively managed, the knowledge demanders (i.e., Internet users) can efficiently absorb and use the knowledge documents; it has become the core topic in this information explosion era. Webpage classification technology with high accuracy can improve the efficiency for Internet users to search required knowledge and to save lots of knowledge-searching time. Differing from previous researches, this paper explores webpage design characteristics for webpage classification. That is, concerning complexity of webpage structure, this paper analyzes the webpage design characteristics including tag attributes and tag-region layout to develop an algorithm for webpage classification. Therefore, based on webpage design characteristic analysis, the text contained in specific tag-regions can be identified. Also, the keywords extracted from each tag-region are weighted according tag attributes and tag-region locations; then, the categories of the target webpage can be determined. Furthermore, based on the hyperlink tag, the similar webpage with higher correlations can be collected to re-determine target webpage categories. In addition to the webpage classification algorithm, a web-based webpage classification system is developed to demonstrate feasibility of the proposed model. The attempt of this research is to analyze and use the characteristics of webpage design for webpage classification technology to improve the effectiveness of classification.
|