全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

A Methodology for Enhancing Template Extraction accuracy Of Heterogeneous Web Pages

Keywords: Template Extraction , Clustering , MDL , Text-MAX , Text-Hash , Jaccard Coefficient , Dice Coefficient.

Full-Text   Cite this paper   Add to My Lib

Abstract:

Today websites contain large number of pages generated using the common templates with contents. Due to irrelevant terms in templates they degrades the accuracy of web application. Thus, template detection techniques have received a lot of attention recently to enhance the accuracy. To extract the template from these heterogeneous templates we use different algorithms to find the similarity of underlying structure in the documents, so that the template is extracted with various clusters. We implement various algorithms to find similarity between the web pages. Earlier the algorithms used are Text Hash and Text Max with jaccard coefficient. But the time and space occupied by this algorithm is more. In this paper, we implement Text Hash and Text Max with Jaccard as well as Dice coefficient. The space and time occupied by Dice coefficient is less as compared to Jaccard coefficient.

Full-Text

comments powered by Disqus

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133