BeautifulSoup: Get the contents of a specific table

问题

My local airport disgracefully blocks users without IE, and looks awful. I want to write a Python scripts that would get the contents of the Arrival and Departures pages every few minutes, and show them in a more readable manner.

My tools of choice are mechanize for cheating the site to believe I use IE, and BeautifulSoup for parsing page to get the flights data table.

Quite honestly, I got lost in the BeautifulSoup documentation, and can't understand how to get the table (whose title I know) from the entire document, and how to get a list of rows from that table.

Any ideas?

回答1:

This is not the specific code you need, just a demo of how to work with BeautifulSoup. It finds the table who's id is "Table1" and gets all of its tr elements.

html = urllib2.urlopen(url).read()
bs = BeautifulSoup(html)
table = bs.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']=="Table1") 
rows = table.findAll(lambda tag: tag.name=='tr')

回答2:

soup = BeautifulSoup(HTML)

# the first argument to find tells it what tag to search for
# the second you can pass a dict of attr->value pairs to filter
# results that match the first tag
table = soup.find( "table", {"title":"TheTitle"} )

rows=list()
for row in table.findAll("tr"):
   rows.append(row)

# now rows contains each tr in the table (as a BeautifulSoup object)
# and you can search them to pull out the times

回答3:

Here is working example for a generic <table>. (Though not using your page due javascript execution needed to load table data)

Extracting the table data from here GDP (Gross Domestic Product) by countries.

table = soup.find('table', { 'class' : 'table table-striped' })
# where the dictionary specify unique attributes for the 'table' tag

Bellow the main tableDataText function parses a html segment started with tag <table> followed by multiple <tr> (table rows) and inner <td> (table data) tags. It returns a list of rows with inner columns. Accepts only one <th> (table header/data) in the first row.

def rowgetDataText(tr, coltag='td'): # td (data) or th (header)
    cols = []
    for td in tr.find_all(coltag):
        cols.append(td.get_text(strip=True))
    return cols

def tableDataText(table):       
    rows = []
    trs = table.find_all('tr')
    headerow = rowgetDataText(trs[0], 'th')
    if headerow: # if there is a header row include first
        rows.append(headerow)
        trs = trs[1:]
    for tr in trs: # for every table row
        rows.append(rowgetDataText(tr, 'td')) # data row
    return rows

Using it we get (first two rows).

list_table = tableDataText(htmltable)
list_table[:2]

[['Rank',
  'Name',
  "GDP (IMF '19)",
  "GDP (UN '16)",
  'GDP Per Capita',
  '2019 Population'],
 ['1',
  'United States',
  '21.41 trillion',
  '18.62 trillion',
  '$65,064',
  '329,064,917']]

That can be easily transformed in a pandas.DataFrame for more advanced manipulation.

import pandas as pd

dftable = pd.DataFrame(list_table[1:], columns=list_table[0])
dftable.head(4)

来源：https://stackoverflow.com/questions/2935658/beautifulsoup-get-the-contents-of-a-specific-table

标签

python

web-scraping

beautifulsoup

tabular