Jsoup.parse() vs. Jsoup.parse() - or How does URL detection work in Jsoup?

让人想犯罪 __ 提交于 2019-12-20 01:34:35

问题


Jsoup has 2 html parse() methods:

  1. parse(String html) - "As no base URI is specified, absolute URL detection relies on the HTML including a tag."
  2. parse(String html, String baseUri) - "The URL where the HTML was retrieved from. Used to resolve relative URLs to absolute URLs, that occur before the HTML declares a tag."

I am having a difficulty understanding the meaning of the difference between the two:

  1. In the 2nd parse() version, what does "resolve relative URLs to absolute URLs, that occur before the HTML declares a <base href> tag" mean? What if a <base href> tag never occurs in the page?
  2. What is the purpose of absolute URL detection? Why does Jsoup need to find the absolute URL?
  3. Lastly, but most importantly: Is baseUri the full URL of HTML page (as phrased in original documentation) or is it the base URL of the HTML page?

回答1:


It's used for among others Element#absUrl() so that you can retrieve the (intended) absolute URL of an <a href>, <img src>, <link href>, <script src>, etc. E.g.

for (Element link : document.select("a")) {
    System.out.println(link.absUrl("href"));
}

This is very useful if you want to download and/or parse the linked resources as well.


In the 2nd parse() version, what does "resolve relative URLs to absolute URLs, that occur before the HTML declares a <base href> tag" mean? What if a <base href> tag never occurs in the page?

Some (poor) websites may have declared a <link> or <script> with a relative URL before the <base> tag. Or if there is no means of a <base> tag, then just the given baseUri will be used for resolving relative URLs of the entire document.


What is the purpose of absolute URL detection? Why does Jsoup need to find the absolute URL?

In order to return the right URL on Element#absUrl(). This is purely for enduser's convenience. Jsoup doesn't need it in order to successfully parse the HTML at its own.


Lastly, but most importantly: Is baseUri the full URL of HTML page (as phrased in original documentation) or is it the base URL of the HTML page?

The former. If the latter, then documentation would be lying. The baseUri must not to be confused with <base href>.



来源:https://stackoverflow.com/questions/7142187/jsoup-parse-vs-jsoup-parse-or-how-does-url-detection-work-in-jsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!