How to parse HTML using XPath with Saxon-HE in command line?

I use saxon HE 9.6, and it's great for playing with XPath 3 while you are parsing well formed XML files.

But I would like to know how to combine expath-http-client (or any other working solution) with Saxon to have the power to parse realLife©®™ (possibly broken) HTML. (Java is not my better skill).

I searched google quite many hours without any working solution. I tried something like :

xquery_file.xsl :

xquery version "1.0";

declare namespace http="http://expath.org/ns/http-client";

let $url := 'http://stackoverflow.com'
let $response := http:send-request(
   <http:request href="{$url}" method="get"/>
) return
    <echo-results>
        {$response}
    </echo-results>

Shell command taken from the README of expath-http-client-saxon-0.10.0

saxon --repo /usr/share/java/expath/repo -xsl:sample/simple-get.xsl -it:main

saxon --repo /usr/share/java/expath/repo -xsl:xquery_file.xsl -it:main

without success. I get : Transformation failed: Unknown configuration property http://saxon.sf.net/feature/repo

What I want to do ideally in final, is to query directly an URL from the command line without a XQuery file but an XPath expression (if possible). I'm pretty sure some XML/Java/XPath guru around there have the solution I'm looking for.

/usr/share/java/expath/repo contains :

/usr/share/java/expath/repo
├── expath-http-client-saxon-0.10.0
│   ├── cxan.xml
│   ├── expath-http-client-saxon
│   │   ├── jar
│   │   │   ├── expath-http-client-java.jar
│   │   │   └── expath-http-client-saxon.jar
│   │   ├── lib
│   │   │   ├── apache-mime4j-0.6.jar
│   │   │   ├── commons-codec-1.4.jar
│   │   │   ├── commons-logging-1.1.1.jar
│   │   │   ├── httpclient-4.0.1.jar
│   │   │   ├── httpcore-4.0.1.jar
│   │   │   └── tagsoup-1.2.jar
│   │   ├── xq
│   │   │   └── expath-http-client-saxon.xq
│   │   └── xsl
│   │       └── expath-http-client-saxon.xsl
│   ├── expath-pkg.xml
│   └── saxon.xml
└── hello-1.1
    ├── expath-pkg.xml
    └── hello
        ├── hello.xq
        └── hello.xsl

EDIT:

My best attempt (linux based solution)

java -classpath "./tagsoup-1.2.jar:./saxon9he.jar" \
    net.sf.saxon.Query \
   -x:org.ccil.cowan.tagsoup.Parser \
   -s:myrealLife.html \
   -qs://*:body

This work, but now I try to figure out how to set the default namespace to be able to query directly by example //a

EDIT 2

I have created a whole github project according to this POST, check https://github.com/sputnick-dev/saxon-lint

I don't think you need any HTTP client for this. You can read the file using the doc() function, or supply it as the primary input document, provided you configure it to be parsed using an HTML SAX parser rather than an XML parser. If you put John Cowan's TagSoup on the classpath, then invoking Saxon with

-x:org.ccil.cowan.tagsoup.Parser -s:myrealLife.html

should do the trick.

I think you can also use validator.nu, which is rather more up-to-speed with HTML5 than TagSoup, but I haven't tried it myself.

If you look at the documentation for the EXPath HTTP Client, you will see that if you retrieve HTML with it, and the server responds with a HTML Internet Media Type, then the HTML will be automatically tidied up into valid XML for you, see here http://expath.org/spec/http-client#d2e517.

As such you will not need to write any Java code to achieve your goal.

Your XQuery is incorrect, as you are trying to use eXist-db's HTTP Client, whereas you state that you want to use the EXPath HTTP Client. So you should change your XQuery to this:

xquery version "1.0";

declare namespace http="http://expath.org/ns/http-client";

let $url := 'http://stackoverflow.com'
let $response := http:send-request(
   <http:request href="{$url}" method="get"/>
) return
    <echo-results>
        {$response}
    </echo-results>

However, you will also need to convince Saxon to load and use the EXPath HTTP Client module, by default Saxon does not have native support for the HTTP Client, see http://saxonica.com/documentation/index.html#!functions.

You can find the EXPath HTTP Client implementation for Saxon here: https://code.google.com/p/expath-http-client/downloads/list and if you download the latest Zip file, inside is a README file which tells you how to use it with Saxon.

来源：https://stackoverflow.com/questions/27820173/how-to-parse-html-using-xpath-with-saxon-he-in-command-line

标签

java

xml

xpath

xquery

saxon