Extracting information from a table except header of the table using bs4

问题

I am trying to extracting information from a table using bs4 and python. when I am using the following code to extract information from header of the table:

    tr_header=table.findAll("tr")[0]
    tds_in_header = [td.get_text()  for td in tr_header.findAll("td")]
    header_items= [data.encode('utf-8')  for data in tds_in_header]
    len_table_header = len (header_items)

It works, but for the following codes that I am trying to extract information from the first row to the end of the table:

    tr_all=table.findAll("tr")[1:]
    tds_all = [td.get_text()  for td in tr_all.findAll("td")]
    table_info= [data.encode('utf-8')  for data in tds_all]

There is the following error:

AttributeError: 'list' object has no attribute 'findAll'

Can anyone help me to edit it.

This is table information:

    <table class="codes"><tr><td><b>Code</b>
</td><td><b>Display</b></td><td><b>Definition</b></td>
</tr><tr><td>active<a name="active"> </a></td>
<td>Active</td><td>This account is active and may be used.</td></tr>
<tr><td>inactive<a name="inactive"> </a></td>
<td>Inactive</td><td>This account is inactive
 and should not be used to track financial information.</td></tr></table>

This is the output for tr_all:

[<tr><td><b>Code</b></td><td><b>Display</b></td><td><b>Definition</b></td></tr>, <tr><td>active<a name="active"> </a></td><td>Active</td><td>This account is active and may be used.</td></tr>, <tr><td>inactive<a name="inactive"> </a></td><td>Inactive</td><td>This account is inactive and should not be used to track financial information.</td></tr>]

回答1:

For Your first question,

import bs4

text = """
<table class="codes"><tr><td><b>Code</b>
</td><td><b>Display</b></td><td><b>Definition</b></td>
</tr><tr><td>active<a name="active"> </a></td>
<td>Active</td><td>This account is active and may be used.</td></tr>
<tr><td>inactive<a name="inactive"> </a></td>
<td>Inactive</td><td>This account is inactive
 and should not be used to track financial information.</td></tr></table>"""

table = bs4.BeautifulSoup(text)
tr_all = table.findAll("tr")[1:]
tds_all = []
for tr in tr_all:
    tds_all.append([td.get_text() for td in tr.findAll("td")])
    # if You prefer double list comprefension instead...
table_info = [data[i].encode('utf-8') for data in tds_all
                                      for i in range(len(tds_all))]
print(table_info)

yields

['active ', 'Active', 'inactive ', 'Inactive']

And regarding Your second question

tr_header=table.findAll("tr")[0] i do not get a list

True, [] is indexing operation, which selects first element from list, thus You get single element. [1:] is slicing operator (take a look at nice tutorial if You need more information).

Actually, You get list two times, for each call of table.findAll("tr") - for header and rest of rows. Sure, this is quite redundant. If You want to separate tokens from header and rest, I think You likely want something like this

tr_all = table.findAll("tr")
header = tr_all[0]
tr_rest = tr_all[1:] 
tds_rest = []
header_data = [td.get_text().encode('utf-8') for td in header]

for tr in tr_rest:
     tds_rest.append([td.get_text() for td in tr.findAll("td")])

and regarding third question

Is it possible to edit this code to add table information from the first row to the end of the table?

Given Your desired output in comments below:

rows_all = table.findAll("tr")
header = rows_all[0]
rows = rows_all[1:]

data = []
for row in rows:
    for td in row:
        try:
            data.append(td.get_text())
        except AttributeError:
            continue
print(data)

# or more or less same as above, oneline
data = [td.get_text() for row in rows for td in row.findAll("td")]

yields

[u'active', u'Active', u'This account is active and may be used.', u'inactive', u'Inactive', u'This account is inactive and should not be used to track financial information.']

回答2:

JustMe answered this question correctly. Another equivalent variant would be:

import bs4

text = """
<table class="codes"><tr><td><b>Code</b>
</td><td><b>Display</b></td><td><b>Definition</b></td>
</tr><tr><td>active<a name="active"> </a></td>
<td>Active</td><td>This account is active and may be used.</td></tr>
<tr><td>inactive<a name="inactive"> </a></td>
<td>Inactive</td><td>This account is inactive
 and should not be used to track financial information.</td></tr></table>"""

table = bs4.BeautifulSoup(text)
tr_all = table.findAll("tr")[1:]
# critical line:
tds_all = [ td.get_text() for each_tr in tr_all for td in each_tr.findAll("td")]
# and after that unchanged:
table_info= [data.encode('utf-8')  for data in tds_all]

# for control:
print(table_info)

This strange construction in the critical line serves as flattening of the list of list 'tds_all'. lambda z: [x for y in z for x in y] flattens the list of list z. I replaced x and y and z according to this specific situation.

Actually I came to it, because I had as an inbetween-step as the critical line: tds_all = [[td.get_text() for td in each_tr.findAll("td")] for each_tr in tr_all ] which generates a list of lists for tds_all: [[u'active ', u'Active', u'This account is active and may be used.'], [u'inactive ', u'Inactive', u'This account is inactive\n and should not be used to track financial information.']] To flatten this, one needs this [x for y in z for x in y] composition. But then I thought, why not apply this structure right to the critical line and flatten it thereby?

z is the list of bs4-objects (tr_all). In this 'for ... in ...'-construct, each_tr (a bs4-object) is taken from the list 'tr_all', and the each_tr object generates in the behind 'for-in'-construct a list of all 'td' matches, by the expression each_tr.findAll("td") from which every match "td" is isolated by this behind 'for ... in ...'-loop, and at the very beginning of this listexpession stands what should be then collected in the final list: the text isolated from this object("td.get_text()"). And this resulting final list is assigned to td_all.

The result of this code is this result list:

['active ', 'Active', 'This account is active and may be used.', 'inactive ', 'Inactive', 'This account is inactive\n and should not be used to track financial information.']

The two longer elements were missing in the example is of JustMe. I think, Mary, you want to have them included, isn't it?

来源：https://stackoverflow.com/questions/37635847/extracting-information-from-a-table-except-header-of-the-table-using-bs4

标签

python

html

bs4