How to save a Jsoup Document to an HTML file?

狂风中的少年 提交于 2019-12-22 01:28:28

问题


I have used this method to retrieve a webpage into an org.jsoup.nodes.Document object:

myDoc = Jsoup.connect(myURL).ignoreContentType(true).get();

How should I write this object to a HTML file? The methods myDoc.html(), myDoc.text() and myDoc.toString() don't output all elements of the document.

Some information in a javascript element can be lost in parsing it. For example, "timestamp" in the source of an Instagram media page.


回答1:


The fact that there are elements that are ignored, must be due to the attempt of normalization by Jsoup.

In order to get the server's exact output without any form of normalization use this.

Connection.Response html = Jsoup.connect("PUT_URL_HERE").execute();
System.out.println(html.body());



回答2:


Use doc.outerHtml().

import org.apache.commons.io.FileUtils;

public void downloadPage() throws Exception {
        final Response response = Jsoup.connect("http://www.example.net").execute();
        final Document doc = response.parse();

        final File f = new File("filename.html");
        FileUtils.writeStringToFile(f, doc.outerHtml(), "UTF-8");
    }

Don't forget to catch Exceptions. Add dependency or download Apache commons-io library for easy and quick way to saving files in UTF-8 format.



来源:https://stackoverflow.com/questions/24696766/how-to-save-a-jsoup-document-to-an-html-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!