JSoup.clean() is not preserving relative URLs

浪尽此生 提交于 2020-01-02 08:40:08

问题


I have tried:

Whitelist.relaxed();
Whitelist.relaxed().preserveRelativeLinks(true);
Whitelist.relaxed().addProtocols("a","href","#","/","http","https","mailto","ftp");
Whitelist.relaxed().addProtocols("a","href","#","/","http","https","mailto","ftp").preserveRelativeLinks(true);

None of them work: When I try to clean a relative url, like <a href="/test.xhtml">test</a> I get the href attribute removed (<a>test</a>).

I am using JSoup 1.8.2.

Any ideas?


回答1:


The problem most likely stems from the call of the clean method. If you give the base URI all should work as expected:

String html = ""
        + "<a href=\"/test.xhtml\">test</a>"
        + "<invalid>stuff</invalid>"
        + "<h2>header1</h2>";
String cleaned = Jsoup.clean(html, "http://base.uri", Whitelist.relaxed().preserveRelativeLinks(true));
System.out.println(cleaned);

The above works and keeps the relative links. With String cleaned = Jsoup.clean(html, Whitelist.relaxed().preserveRelativeLinks(true)) however the link is deleted.

Note the documentation of Whitelist.preserveRelativeLinks(true):

Note that when handling relative links, the input document must have an appropriate base URI set when parsing, so that the link's protocol can be confirmed. Regardless of the setting of the preserve relative links option, the link must be resolvable against the base URI to an allowed protocol; otherwise the attribute will be removed.



来源:https://stackoverflow.com/questions/35563027/jsoup-clean-is-not-preserving-relative-urls

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!