%0 Journal Article
%T Near-Duplicates Detection and Elimination Based on Web Provenance for Effective Web Search
%A Y. Syed Mudhasir
%A J. Deepika
%A S. Sendhilkumar
%A G.S. Mahalakshmi
%J International Journal on Internet and Distributed Computing Systems
%D 2011
%I IJIDCS Press
%X Users of World Wide Web utilize search engines for information retrieval in web as search engines play a vital role in finding information on the web. However, the performance of a web search is greatly affected by flooding of search results with information that is redundant in nature i.e., existence of near-duplicates. Such near-duplicates holdup the other promising results to the users. Many of these near-duplicates are from distrusted websites and/or authors who host information on web. Such near-duplicates may be eliminated by means of Provenance. Thus, this paper proposes a novel approach to identify such near-duplicates based on provenance. In this approach a provenance model has been built using web pages which are the search results returned by existing search engine. The proposed model combines both content based and trust based factors for classifying the results as original or near-duplicates.
%K Web search
%K Near-duplicates
%K Provenance
%K Semantics
%K Trustworthiness
%U http://www.ijidcs.org/issues/v1n1/ijidcs-4.pdf