How to parse HTML using XPath with Saxon-HE in command line?

。_饼干妹妹 提交于 2019-12-06 14:38:21

I don't think you need any HTTP client for this. You can read the file using the doc() function, or supply it as the primary input document, provided you configure it to be parsed using an HTML SAX parser rather than an XML parser. If you put John Cowan's TagSoup on the classpath, then invoking Saxon with

-x:org.ccil.cowan.tagsoup.Parser -s:myrealLife.html

should do the trick.

I think you can also use validator.nu, which is rather more up-to-speed with HTML5 than TagSoup, but I haven't tried it myself.

If you look at the documentation for the EXPath HTTP Client, you will see that if you retrieve HTML with it, and the server responds with a HTML Internet Media Type, then the HTML will be automatically tidied up into valid XML for you, see here http://expath.org/spec/http-client#d2e517.

As such you will not need to write any Java code to achieve your goal.

Your XQuery is incorrect, as you are trying to use eXist-db's HTTP Client, whereas you state that you want to use the EXPath HTTP Client. So you should change your XQuery to this:

xquery version "1.0";

declare namespace http="http://expath.org/ns/http-client";

let $url := 'http://stackoverflow.com'
let $response := http:send-request(
   <http:request href="{$url}" method="get"/>
) return
    <echo-results>
        {$response}
    </echo-results>

However, you will also need to convince Saxon to load and use the EXPath HTTP Client module, by default Saxon does not have native support for the HTTP Client, see http://saxonica.com/documentation/index.html#!functions.

You can find the EXPath HTTP Client implementation for Saxon here: https://code.google.com/p/expath-http-client/downloads/list and if you download the latest Zip file, inside is a README file which tells you how to use it with Saxon.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!