Retrieving tail text from html

问题

Python 2.7 using lxml

I have some annoyingly formed html that looks like this:

<td>
<b>"John"
</b>
<br>
"123 Main st.
"
<br>
"New York
"
<b>
"Sally"
</b>
<br>
"101 California St.
"
<br>
"San Francisco
"
</td>

So basically it's a single td with a ton of stuff in it. I'm trying to compile a list or dict of the names and their addresses.

So far what I've done is gotten a list of nodes with names using tree.xpath('//td/b'). So let's assume I'm currently on the b node for John.

I'm trying to get whatever.xpath('string()') for everything following the current node but preceding the next b node (Sally). I've tried a bunch of different xpath queries but can't seem to get this right. In particular, any time I use an and operator in an expression that has no [] brackets, it returns a bool rather than a list of all nodes meeting the conditions. Can anyone help out?

回答1:

This should work:

from lxml import etree

p = etree.HTMLParser()
html = open(r'./test.html','r')
data = html.read()
tree = etree.fromstring(data, p)

my_dict = {}

for b in tree.iter('b'):
    br = b.getnext().tail.replace('\n', '')
    my_dict[b.text.replace('\n', '')] = br

print my_dict

This code prints:

{'"John"': '"123 Main st."', '"Sally"': '"101 California St."'}

(You may want to strip the quotation marks out!)

Rather than using xpath, you could use one of lxml's parsers in order to easily navigate the HTML. The parser will turn the HTML document into an "etree", which you can navigate with provided methods. The lxml module provides a method called iter() which allows you to pass in a tag name and receive all elements in the tree with that name. In your case, if you use this to obtain all of the <b> elements, you can then manually navigate to the <br> element and retrieve its tail text, which contains the information you need. You can find information about this in the "Elements contain text" header of the lxml.etree tutorial.

回答2:

What not use getchildren function from view of each td. For example:

from lxml import html

s = """
<td>
<b>"John"
</b>
<br>
"123 Main st.
"
<br>
"New York
"
<b>
"Sally"
</b>
<br>
"101 California St.
"
<br>
"San Francisco
"
</td>
"""

records = []
cur_record = -1
cur_field = 1

FIELD_NAME = 0
FIELD_STREET = 1
FIELD_CITY = 2

doc = html.fromstring(s)
td = doc.xpath('//td')[0]
for child in td.getchildren():
    if child.tag == 'b':
        cur_record += 1
        record = dict()
        record['name'] = child.text.strip()
        records.append(record)
        cur_field = 1
    elif child.tag == 'br':
        if cur_field == FIELD_STREET:
            records[cur_record]['street'] = child.tail.strip()
            cur_field += 1
        elif cur_field == FIELD_CITY:
            records[cur_record]['city'] = child.tail.strip()

And the results are:

records = [
           {'city': '"New York\n"', 'name': '"John"\n', 'street': '"123 Main st.\n"'},
           {'city': '"San Francisco\n"', 'name': '\n"Sally"\n', 'street': '"101 California St.\n"'}
          ]

Note you should use tag.tail if you want to get text of some non-close html tag, e.g., <br>.

Hope this would be helpful.

来源：https://stackoverflow.com/questions/39601578/retrieving-tail-text-from-html

标签

python

xpath

lxml