How to classify URLs? what are URLs features? How to select and Extract features from URL

人盡茶涼 提交于 2019-12-02 22:52:46

I assume you do not have access to the content of the URL thus you can only extract features from the url string itself. Otherwise it makes more sense to use the content of the URL.

Here are some features I will try. See this paper for more ideas:

  1. All url components. For example, this page has the below url:

    http://stackoverflow.com/questions/26456904/how-to-classify-urls-what-are-urls-features-how-to-select-and-extract-features

All tokens that occurs in different parts of URLs should have variable value to the classification. In this case, the last part after tokenization contributes great features for this page. (e.g., classify, urls, select, extract, features)

 * stackoverflow
 * com
 * questions
 * 26456904
 * how to classify urls what are urls features how to select and extract features
  1. The length of a url;
  2. n-grams (2-grams as examples below)
    • stackoverflow-com
    • com-questions
    • questions-26456904
    • 26456904-how
    • how-to
    • ....
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!