how to extract an XPATH from an html page with Saxon-PE commandline

依然范特西╮ 提交于 2020-01-06 08:14:12

问题


I would like to extract the XPATH //DIV[@id="ps-content"] out from this web page: http://www.amazon.com/dp/1449319432 (saved as a local file)

I would like to do it with a single line of command-line with one of the best parsers, like Saxon-PE or BaseX.

So far the shortest solution that I (seemed to have) found is with these two lines:

java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"
java -cp saxon9pe.jar net.sf.saxon.Query -s:"page.xhtml" -qs:"//DIV[@id='ps-content']"

but all what it returns is this, that is not my expected block of html code:

<?xml version="1.0" encoding="UTF-8"?>

My questions are two:

  • what's wrong with my command-lines? why they doesn't return the expected block of html code as defined by my XPATH?
  • since Saxon-PE has embedded TagSoup capability (see https://www.odesk.com/leaving-odesk?ref=http%253A%252F%252Fsaxonica.com%252Fdocumentation9.4-demo%252Fhtml%252Fextensions%252Ffunctions%252Fparse-html.html), how can I integrate my two lines into a single line?

回答1:


I found the correct command-line to launch the query without TagSoup:

java -cp saxon9pe.jar net.sf.saxon.Query -s:"test.xhtm" -qs:"//*:div[@id='ps-content']"

Note that inverting the type of quotes like this doesn't work (in Win7):

java -cp saxon9pe.jar net.sf.saxon.Query -s:"test.xhtm" -qs:'//*:div[@id="ps-content"]'

Does anyone know how to add the TagSoup preprocess in the same command-line?




回答2:


My last failed attempts to integrate TagSoup in the same command-line:

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(fn:unparsed-text('
page.html'))//*:div[@id='ps-content']"
Error on line 1 column 17
  XPST0017 XQuery static error near #...:unparsed-text('page.html'))//#:
    System function unparsed-text#1 is not available with this host language/ver
sion
Static error(s) in query

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"fn:parse-html(fn:unparsed-text('pag
e.html'))//*:div[@id='ps-content']"
Error on line 1 column 14
  XPST0017 XQuery static error near #...:unparsed-text('page.html'))//#:
    System function unparsed-text#1 is not available with this host language/ver
sion
Static error(s) in query

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(collection('page.h
tml;unparsed=yes'))//*:div[@id='ps-content']"
Error on line 1 of *module with no systemId*:
  FODC0002: The file or directory
  file:/C:/Users/diego/Downloads/SaxonPE9-4-0-7J/page.html;unparsed=yes does not
 exist
Query processing failed: Run-time errors were reported

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(collection('page.h
tml';'unparsed=yes'))//*:div[@id='ps-content']"
Error on line 1 column 39
  XPST0003 XQuery syntax error near #...ion('page.html';'unparsed=yes'#:
    expected ")", found ";"
Static error(s) in query


来源:https://stackoverflow.com/questions/17013688/how-to-extract-an-xpath-from-an-html-page-with-saxon-pe-commandline

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!