HTML page to XHTML with TagSoup

亡梦爱人 提交于 2019-12-07 12:36:55

问题


Sorry if this is too simple, but I simply couldn't find a tutorial nor the documentation of the Java version of TagSoup.

Basically I want to download an HTML webpage from the internet and turn it into XHTML, contained in a string. How can I do this with TagSoup?

Thanks!


回答1:


Something like this:

wget -O - example.com/bad.html | java -jar tagsoup.jar

Or, from Java:

To parse HTML:

  • Create an instance of org.ccil.cowan.tagsoup.Parser
  • Provide your own SAX2 ContentHandler
  • Provide an InputSource referring to the HTML
  • And parse()!



回答2:


Below is the code which should provide you with a means to pull down a web page and parse it accordingly with TagSoup...

        HttpClient client = new DefaultHttpClient();
        HttpGet request = new HttpGet("http://streak.espn.go.com/en/?date=20120824");
        HttpResponse response = client.execute(request);

        // Check if server response is valid
        StatusLine status = response.getStatusLine();
        if (status.getStatusCode() != 200) {
            throw new IOException("Invalid response from server: " + status.toString());
        }

        // Pull content stream from response
        HttpEntity entity = response.getEntity();
        InputStream inputStream = entity.getContent();

        try
        {
            XMLReader parser = XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser");

            // Use the TagSoup parser to build an XOM document from HTML
            Document doc = new Builder(parser).build(builder.toString());

            // Push your data to string or XML
            doc.toString();
            doc.toXML();
        }
        catch(IOException e)
        { ... }


来源:https://stackoverflow.com/questions/1589176/html-page-to-xhtml-with-tagsoup

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!