Webharvest If and null test

余生颓废 提交于 2019-12-06 23:42:28

I have found the same problem as you, where the example from the official WH user manual does not work, because of double single quotes.

as a work around I use: variable.toString().length() > 0

and here is your code:

<var-def name="googleResults">
    <xpath expression="//div[@id='center_col']//div[@id='search']//div[@id='ires']//ol/li/div//b/div/text()">
            <http url="http://google.com/shopping?q=asus laptops&amp;hl=en"/>

<var-def name="productTruth">
        <if condition="${googleResults.toString().length() > 0}">
            <var name="googleResults"/>
            <xpath expression="//div[@id='center_col']//div[@id='search']//div[@id='ires']//ol/li/div//b/text()">
                    <http url="http://google.com/shopping?q=asus laptops&amp;hl=en"/>

Also, a few notes on your code in general:

1) Actually downloading the page is the most time and memory - consuming part of web harvest. If the information you want is not collected by the first xpath, you end up re-downloading the page (re-running the http request). save the result of the http request in a variable and you can then re-query the result, without repeating the download - this also limits the number of times you hit the source server, which becomes an issue if you have multiple pages to scrape.

    <var-def name="pagetext">
                <http url="http://google.com/shopping?q=asus laptops&amp;hl=en"/>

    <var-def name="googleResults">
        <xpath expression="//div[@id='center_col']//div[@id='search']//div[@id='ires']//ol/li/div//b/div/text()">
            <var name="pagetext"/>

    <var-def name="productTruth">
            <if condition="${googleResults.toString().length() > 0}">
                <var name="googleResults"/>
                <xpath expression="//div[@id='center_col']//div[@id='search']//div[@id='ires']//ol/li/div//b/text()">
                    <var name="pagetext"/>

2) you can avoid the whole conditional by changing the xpath:


    <var-def name="pagetext">
                <http url="http://google.com/shopping?q=asus laptops&amp;hl=en"/>

    <var-def name="googleResults">
        <xpath expression="//div[@id='center_col']//div[@id='search']//div[@id='ires']//ol/li/div//b/descendant-or-self::text()">
            <var name="pagetext"/>

You may use normalize-space(.) != '' instead of ${googleResults != null}.

To manipulate a defined variable to exclude certain parts of strings like symbols and numbers use starts-with() ends-with() matches(), contains() any one of them as per your needs and webharvest support.

Take an example to check <b>dfsdffsnavindfds</b> element:

  1. /b[starts-with(text(), 'd')] -- to find out if it is has starting character 'd'
  2. /b[ends-with(text(), 's')] -- to find out it if is has starting character 's'
  3. /b[contains(text(), 'navin')] -- to find out if it is has string 'navin'

For more information look at http://www.w3schools.com/xpath/xpath_functions.asp
