How do you get all the rows from a particular table using BeautifulSoup?

后端未结

关注

 2  1666

梦谈多话 2020-12-24 07:29

I am learning Python and BeautifulSoup to scrape data from the web, and read a HTML table. I can read it into Open Office and it says that it is Table #11.

It seems

2条回答

刺人心 (楼主)

2020-12-24 08:26

If you ever have nested tables (as on the old-school designed websites), the above approach might fail.

As a solution, you might want to extract non-nested tables first:

html = '''

Top level table cell

    
    Nested table cell
    ...another nested cell
    


'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
non_nested_tables = [t for t in soup.find_all('table') if not t.find_all('table')]

Alternatively, if you want to extract content of all the tables, including those that nest other tables, you can extract only top-level tr and th/td headers. For this, you need to turn off recursion when calling the find_all method:

soup = BeautifulSoup(html, 'lxml')
tables = soup.find_all('table')
cnt = 0
for my_table in tables:
    cnt += 1
    print ('=============== TABLE {} ==============='.format(cnt))
    rows = my_table.find_all('tr', recursive=False)                  # <-- HERE
    for row in rows:
        cells = row.find_all(['th', 'td'], recursive=False)          # <-- HERE
        for cell in cells:
            # DO SOMETHING
            if cell.string: print (cell.string)

Output:

=============== TABLE 1 ===============
Top level table cell
=============== TABLE 2 ===============
Nested table cell
...another nested cell

0 讨论(0)

查看其它2个回答