ELIMINATE DUPLICATE URLs USING MULTIPLE ALIGNMENT OF SEQUENCES
Main Article Content
Abstract
Duplicate content means search engines have to waste time in crawling all the different duplicate versions of a page, and you’re relying on them to do it in the way you want them to. Duplicate content is the substantive block either in domain or across domains that either partially or completely matches with other contents. Mostly, this is not deceptive in origin. Use of such duplicated data is a waste of resources which results in poor user experiences. We focus on removing links of duplicate contents by address of the website i.e. URL. We will convert URL into multiple alignments of sequences and perform the operations.Also URL tokenizer is used to understand the web protocol and top -level domain. This approach will help in a healthy way to remove same content from a set of web pages. So web crawlers can easily accept this approach and can make better indexing possible.The proposed method reduced number of duplicate URLs than the existing approach.