how to extract an XPATH from an html page with BaseX commandline

不问归期 提交于 2019-12-24 10:46:13

问题


I would like to extract the XPATH //DIV[@id="ps-content"] out from this web page: http://www.amazon.com/dp/1449319432 (saved as a local file)

I would like to do it with a single line of command-line with one of the best parsers, like BaseX or Saxon-PE.

So far the shortest solution that I (seemed to have) found is with these two lines:

java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"
basex -ipage.xhtml "//DIV[@id='ps-content']"

but all what it returns is an empty line, instead than my expected block of html code:

My questions are two:

  • what's wrong with my command-lines? why they doesn't return the expected block of html code as defined by my XPATH?
  • since BaseX has embedded TagSoup capability (see https://www.odesk.com/leaving-odesk?ref=http%253A%252F%252Fdocs.basex.org%252Fwiki%252FParsers%2523HTML_Parser), how can I integrate my two lines into a single line?

回答1:


There are two problems with your query:

  1. Tagsoup adds namespaces

    Either register the namespace (it seems reasonable to declare the default namespace as you're probably only dealing with XHTML):

    basex -ipage.xhtml "declare default element namespace 'http://www.w3.org/1999/xhtml'; //div[@id='ps-content']"
    

    or use * as namespace indicator for each element:

    basex -ipage.xhtml "//*:div[@id='ps-content']"
    
  2. XML/XQuery is case sensitive

    I already corrected it in my queries in (1): <div/> is not the same as <DIV/>. Both queries in (1) already yield the expected result.


Tagsoup can be used from within BaseX, you do not have to call it separately for HTML-input. Make sure to include tagsoup in your default Java classpath, eg. by installing libtagsoup-java in Debian.

basex 'declare option db:parser "html"; doc("page.html")//*:div[@id="ps-content"]'

You can even query the HTML page directly from BaseX if you want to:

basex 'declare option db:parser "html"; doc("http://www.amazon.com/dp/1449319432")//*:div[@id="ps-content"]'

Using -i didn't work for me with using tagsoup, but you can use doc(...) instead.




回答2:


I finally found the right command-line:

basex "declare option db:parser 'html'; doc('page.html')//*:div[@id='ps-content']"

Note: inverting the type of quotes like this doesn't work in my Win7:

basex 'declare option db:parser "html"; doc("page.html")//*:div[@id="ps-content"]'


来源:https://stackoverflow.com/questions/17014152/how-to-extract-an-xpath-from-an-html-page-with-basex-commandline

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!