Webharvest If and null test

余生颓废 提交于 2019-12-06 23:42:28
user3616725

I have found the same problem as you, where the example from the official WH user manual does not work, because of double single quotes.

as a work around I use: variable.toString().length() > 0

and here is your code:

<var-def name="googleResults">
    <xpath expression="//div[@id='center_col']//div[@id='search']//div[@id='ires']//ol/li/div//b/div/text()">
        <html-to-xml>
            <http url="http://google.com/shopping?q=asus laptops&amp;hl=en"/>
        </html-to-xml>
    </xpath>
</var-def>

<var-def name="productTruth">
    <case>
        <if condition="${googleResults.toString().length() > 0}">
            <var name="googleResults"/>
        </if>
        <else>
            <xpath expression="//div[@id='center_col']//div[@id='search']//div[@id='ires']//ol/li/div//b/text()">
                <html-to-xml>
                    <http url="http://google.com/shopping?q=asus laptops&amp;hl=en"/>
                </html-to-xml>
            </xpath>
        </else>
    </case>
</var-def>

Also, a few notes on your code in general:

1) Actually downloading the page is the most time and memory - consuming part of web harvest. If the information you want is not collected by the first xpath, you end up re-downloading the page (re-running the http request). save the result of the http request in a variable and you can then re-query the result, without repeating the download - this also limits the number of times you hit the source server, which becomes an issue if you have multiple pages to scrape.

    <var-def name="pagetext">
            <html-to-xml>
                <http url="http://google.com/shopping?q=asus laptops&amp;hl=en"/>
            </html-to-xml>
    </var-def>

    <var-def name="googleResults">
        <xpath expression="//div[@id='center_col']//div[@id='search']//div[@id='ires']//ol/li/div//b/div/text()">
            <var name="pagetext"/>
        </xpath>
    </var-def>

    <var-def name="productTruth">
        <case>
            <if condition="${googleResults.toString().length() > 0}">
                <var name="googleResults"/>
            </if>
            <else>
                <xpath expression="//div[@id='center_col']//div[@id='search']//div[@id='ires']//ol/li/div//b/text()">
                    <var name="pagetext"/>
                </xpath>
            </else>
        </case>
    </var-def>

2) you can avoid the whole conditional by changing the xpath:

//div[@id='center_col']//div[@id='search']//div[@id='ires']//ol/li/div//b/descendant-or-self::text()

    <var-def name="pagetext">
            <html-to-xml>
                <http url="http://google.com/shopping?q=asus laptops&amp;hl=en"/>
            </html-to-xml>
    </var-def>

    <var-def name="googleResults">
        <xpath expression="//div[@id='center_col']//div[@id='search']//div[@id='ires']//ol/li/div//b/descendant-or-self::text()">
            <var name="pagetext"/>
        </xpath>
    </var-def>

You may use normalize-space(.) != '' instead of ${googleResults != null}.

To manipulate a defined variable to exclude certain parts of strings like symbols and numbers use starts-with() ends-with() matches(), contains() any one of them as per your needs and webharvest support.

Take an example to check <b>dfsdffsnavindfds</b> element:

  1. /b[starts-with(text(), 'd')] -- to find out if it is has starting character 'd'
  2. /b[ends-with(text(), 's')] -- to find out it if is has starting character 's'
  3. /b[contains(text(), 'navin')] -- to find out if it is has string 'navin'

For more information look at http://www.w3schools.com/xpath/xpath_functions.asp

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!