I am trying to extract the first and third columns of this data table using BeautifulSoup. From looking at the HTML the first column has a
In addition to @jonhkr's answer I thought I'd post an alternate solution I came up with. Unlike jonhkr's answer, which dials into the webpage, mine assumes that you have it save on your computer and pass it as a command line argument. For example: tag. The o
#!/usr/bin/python
from BeautifulSoup import BeautifulSoup
from sys import argv
filename = argv[1]
#get HTML file as a string
html_doc = ''.join(open(filename,'r').readlines())
soup = BeautifulSoup(html_doc)
table = soup.findAll('table')[0].tbody
data = map(lambda x: (x.findAll(text=True)[1],x.findAll(text=True)[5]),table.findAll('tr'))
print data
python file.py table.html