How to parse HTML table against a list of variables using lxml?

风格不统一 提交于 2019-12-04 19:19:05

Aiming to produce this dictionary:

{'City:': 'NYC', 
 'Zip:': '10022', 
 'Street 1:': '2100 5th Ave', 
 'Country:': 'USA', 
 'State:': 'NY', 
 'Street 2:': 'Ste 202'}

You can use this code. And then it is easy to query the dictionary to get the values you desire:

import lxml.html as lh

test = '''<tr>
    <td></td>
    <td colspan="2">
        Street 1:<span class="required"> *</span><br />
        <span class="boldred">2100 5th Ave</span>
    </td>
    <td colspan="2">
        Street 2:<br />
        <span class="boldred">Ste 202</span>
    </td>
</tr>
<tr>
    <td></td>
    <td>
        City:<span class="required"> *</span><br />
        <span class="boldred">NYC</span>
    </td>
    <td>
        State:<br />
        <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">NY</SPAN>
    </td>
    <td>
        Country:<span class="required"> *</span><br />
        <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">USA</SPAN>
    </td>
    <td>
        Zip:<br />
        <span class="boldred">10022</span>
    </td>
</tr>'''

outhtml = lh.fromstring(test)
ks = [ k.strip() for k in outhtml.xpath('//tr/td/text()') if k.strip() != '' ]
vs = outhtml.xpath('//tr/td/span[@class="boldred"]/text()')

result = dict( zip(ks,vs) )

print result

lxml_tempsofsol.py :

import lxml.html as lh

desiredvars = ['Street 1','Street 2','City', 'State', 'Zip']

doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()

myresultset = ((var, outhtml.xpath('//tr/td[contains(text(), "%s")]/span[@class="boldred"]/text()'%(var))[0]) for var in desiredvars)

for each in myresultset:
    print each

Output :

$ python lxml_tempsofsol.py
('Street 1', '2100 5th Ave')
('Street 2', 'Ste 202')
('City', 'NYC')
('State', 'NY')
('Zip', '10022')

I've searched for the same thing and found your question and no "right" answer so I'll add a couple of points:

  • To refer to variables in XPath you should use $var syntax,
  • In lxml variables are passed as keyword arguments to xpath(),
  • Using child::* is wrong since you search for text directly within <td/>; text() already searches for text child nodes,
  • You need to use contains() XPath function due to whitespace.

Taking those into account your corrected code looks like this:

import lxml.html as lh

desiredvars = ['Street 1','Street 2','City', 'State', 'Zip']

doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()

myresultset = [(var, outhtml.xpath('//tr/td[contains(text(), $var)]/span[@class="boldred"]/text()', var=var)) for var in desiredvars]
print myresultset
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!