I am learning Python and BeautifulSoup to scrape data from the web, and read a HTML table. I can read it into Open Office and it says that it is Table #11.
It seems
If you ever have nested tables (as on the old-school designed websites), the above approach might fail.
As a solution, you might want to extract non-nested tables first:
html = '''
Top level table cell
Nested table cell
...another nested cell
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
non_nested_tables = [t for t in soup.find_all('table') if not t.find_all('table')]
Alternatively, if you want to extract content of all the tables, including those that nest other tables, you can extract only top-level tr and th/td headers. For this, you need to turn off recursion when calling the find_all method:
soup = BeautifulSoup(html, 'lxml')
tables = soup.find_all('table')
cnt = 0
for my_table in tables:
cnt += 1
print ('=============== TABLE {} ==============='.format(cnt))
rows = my_table.find_all('tr', recursive=False) # <-- HERE
for row in rows:
cells = row.find_all(['th', 'td'], recursive=False) # <-- HERE
for cell in cells:
# DO SOMETHING
if cell.string: print (cell.string)
Output:
=============== TABLE 1 ===============
Top level table cell
=============== TABLE 2 ===============
Nested table cell
...another nested cell