How to normalize a URL in Java?

后端 未结 8 1145
孤独总比滥情好
孤独总比滥情好 2020-12-09 01:27

URL normalization (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization p

8条回答
  •  攒了一身酷
    2020-12-09 02:15

    Because you also want to identify URLs which refer to the same content, I found this paper from the WWW2007 pretty interesting: Do Not Crawl in the DUST: Different URLs with Similar Text. It provides you with a nice theoretical approach.

提交回复
热议问题