How to parse HTML table against a list of variables using lxml?

I am trying to parse an HTML table using lxml. While rows = outhtml.xpath('//tr/td/span[@class="boldred"]/text()') fetches the results, I am trying to extract the column contents only when it starts with a variable in my config file. For instance, if a <td> starts with 'Street 1', I then want to grab the <span> contents of that <td> tag. This way, I can have a tuple of tuples (which takes care of the None values) which I can then store in the database.

lxml_parse.py

import lxml.html as lh

doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()

rows = outhtml.xpath('//tr/td/span[@class="boldred"]/text()')
print rows

test.htm

<tr>

    <td></td>

    <td colspan="2">

        Street 1:<span class="required"> *</span><br />

        <span class="boldred">2100 5th Ave</span>

    </td>

    <td colspan="2">

        Street 2:<br />

        <span class="boldred">Ste 202</span>

    </td>

</tr>

<tr>

    <td></td>

    <td>

        City:<span class="required"> *</span><br />

        <span class="boldred">NYC</span>

    </td>

    <td>

        State:<br />

        <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">NY</SPAN>

    </td>

    <td>

        Country:<span class="required"> *</span><br />

        <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">USA</SPAN>

    </td>

    <td>

        Zip:<br />

        <span class="boldred">10022</span>

    </td>

</tr>

Output :

$ python lxml_parse.py 
['2100 5th Ave', 'Ste 202', 'NYC', 'NY', 'USA', '10022']

Parse against a bunch of variables is what I am having problems with :

import lxml.html as lh

desiredvars = ['Street 1','Street 2','City', 'State', 'Zip']

doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()

myresultset = ((var, outhtml.xpath('//tr/td[child::*[text()=var]]/span[@class="boldred"]/text()')) for var in desiredvars)
print myresultset

Aiming to produce this dictionary:

{'City:': 'NYC', 
 'Zip:': '10022', 
 'Street 1:': '2100 5th Ave', 
 'Country:': 'USA', 
 'State:': 'NY', 
 'Street 2:': 'Ste 202'}

You can use this code. And then it is easy to query the dictionary to get the values you desire:

import lxml.html as lh

test = '''<tr>
    <td></td>
    <td colspan="2">
        Street 1:<span class="required"> *</span><br />
        <span class="boldred">2100 5th Ave</span>
    </td>
    <td colspan="2">
        Street 2:<br />
        <span class="boldred">Ste 202</span>
    </td>
</tr>
<tr>
    <td></td>
    <td>
        City:<span class="required"> *</span><br />
        <span class="boldred">NYC</span>
    </td>
    <td>
        State:<br />
        <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">NY</SPAN>
    </td>
    <td>
        Country:<span class="required"> *</span><br />
        <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">USA</SPAN>
    </td>
    <td>
        Zip:<br />
        <span class="boldred">10022</span>
    </td>
</tr>'''

outhtml = lh.fromstring(test)
ks = [ k.strip() for k in outhtml.xpath('//tr/td/text()') if k.strip() != '' ]
vs = outhtml.xpath('//tr/td/span[@class="boldred"]/text()')

result = dict( zip(ks,vs) )

print result

lxml_tempsofsol.py :

import lxml.html as lh

desiredvars = ['Street 1','Street 2','City', 'State', 'Zip']

doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()

myresultset = ((var, outhtml.xpath('//tr/td[contains(text(), "%s")]/span[@class="boldred"]/text()'%(var))[0]) for var in desiredvars)

for each in myresultset:
    print each

Output :

$ python lxml_tempsofsol.py
('Street 1', '2100 5th Ave')
('Street 2', 'Ste 202')
('City', 'NYC')
('State', 'NY')
('Zip', '10022')

I've searched for the same thing and found your question and no "right" answer so I'll add a couple of points:

To refer to variables in XPath you should use $var syntax,
In lxml variables are passed as keyword arguments to xpath(),
Using child::* is wrong since you search for text directly within <td/>; text() already searches for text child nodes,
You need to use contains() XPath function due to whitespace.

Taking those into account your corrected code looks like this:

import lxml.html as lh

desiredvars = ['Street 1','Street 2','City', 'State', 'Zip']

doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()

myresultset = [(var, outhtml.xpath('//tr/td[contains(text(), $var)]/span[@class="boldred"]/text()', var=var)) for var in desiredvars]
print myresultset

来源：https://stackoverflow.com/questions/10642513/how-to-parse-html-table-against-a-list-of-variables-using-lxml

标签

html

lxml

python