ELIMINATE DUPLICATE URLs USING MULTIPLE ALIGNMENT OF SEQUENCES

Main Article Content

Jayashri Waman
Prof. Pankaj Agarkar

Abstract

Duplicate content means search engines have to waste time in crawling all the different duplicate versions of a page, and you’re relying on them to do it in the way you want them to. Duplicate content is the substantive block either in domain or across domains that either partially or completely matches with other contents. Mostly, this is not deceptive in origin. Use of such duplicated data is a waste of resources which results in poor user experiences. We focus on removing links of duplicate contents by address of the website i.e. URL. We will convert URL into multiple alignments of sequences and perform the operations.Also URL tokenizer is used to understand the web protocol and top -level domain. This approach will help in a healthy way to remove same content from a set of web pages. So web crawlers can easily accept this approach and can make better indexing possible.The proposed method reduced number of duplicate URLs than the existing approach.

Article Details

How to Cite
Waman, J., & Agarkar, P. P. (2015). ELIMINATE DUPLICATE URLs USING MULTIPLE ALIGNMENT OF SEQUENCES. Multidisciplinary Journal of Research in Engineering and Technology, 2(4), 806–811. https://doi.org/10.65521/mjret.v2i4.1180
Section
Articles

Similar Articles

1 2 > >> 

You may also start an advanced similarity search for this article.