How do you get all the rows from a particular table using BeautifulSoup?

后端 未结 2 1666
梦谈多话
梦谈多话 2020-12-24 07:29

I am learning Python and BeautifulSoup to scrape data from the web, and read a HTML table. I can read it into Open Office and it says that it is Table #11.

It seems

2条回答
  •  刺人心
    刺人心 (楼主)
    2020-12-24 08:26

    If you ever have nested tables (as on the old-school designed websites), the above approach might fail.

    As a solution, you might want to extract non-nested tables first:

    html = '''
    Top level table cell
    Nested table cell
    ...another nested cell
    ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') non_nested_tables = [t for t in soup.find_all('table') if not t.find_all('table')]

    Alternatively, if you want to extract content of all the tables, including those that nest other tables, you can extract only top-level tr and th/td headers. For this, you need to turn off recursion when calling the find_all method:

    soup = BeautifulSoup(html, 'lxml')
    tables = soup.find_all('table')
    cnt = 0
    for my_table in tables:
        cnt += 1
        print ('=============== TABLE {} ==============='.format(cnt))
        rows = my_table.find_all('tr', recursive=False)                  # <-- HERE
        for row in rows:
            cells = row.find_all(['th', 'td'], recursive=False)          # <-- HERE
            for cell in cells:
                # DO SOMETHING
                if cell.string: print (cell.string)
    

    Output:

    =============== TABLE 1 ===============
    Top level table cell
    =============== TABLE 2 ===============
    Nested table cell
    ...another nested cell
    

提交回复
热议问题