问题
I'm trying to extract some data from two html tables in a html file with BeautifulSoup.
This is actually the first time I'm using it and I'searched a lot of questions/example but none seem to work in my case. The html contains two tables, the first with the headers of the first column (which are always text) and the second, containing the data of the following columns. Moreover, the table contains text, numbers and also symbols. This makes for a novice like me everything more complicated. Here's the layout of the html, copied from the browser I was able to extract the whole html content of the rows, but only for the first tables, so in reality I am not getting any data but only the content of the first column.
The output I'm trying to obtain is a string containing the "joint" information of the tables (Col1= text, Col2=number, Col3=number, Col4=number, Col5=number), such as:
Canada, 6, 5, 2, 1
Here's the list of the Xpaths for each item:
"Canada": /html/body/div/div[1]/table/tbody[2]/tr[2]/td/div/a
"6": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[1]
"5": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[3]
"2": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[5]
"1": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[7]
I would be also happy with strings in "rough" html format, as long as there is one string per row, so that I'll be able to parse it further with the methods I already know. Here's the code I have so far. Thanks!
from BeautifulSoup import BeautifulSoup
html="""
my html code
"""
soup = BeautifulSoup(html)
table=soup.find("table")
for row in table.findAll('tr'):
col = row.findAll('td')
print row, col
回答1:
Using bs4, but this should work:
from bs4 import BeautifulSoup as bsoup
ofile = open("htmlsample.html")
soup = bsoup(ofile)
soup.prettify()
tables = soup.find_all("tbody")
storeTable = tables[0].find_all("tr")
storeValueRows = tables[2].find_all("tr")
storeRank = []
for row in storeTable:
storeRank.append(row.get_text().strip())
storeMatrix = []
for row in storeValueRows:
storeMatrixRow = []
for cell in row.find_all("td")[::2]:
storeMatrixRow.append(cell.get_text().strip())
storeMatrix.append(", ".join(storeMatrixRow))
for record in zip(storeRank, storeMatrix):
print " ".join(record)
The above will print out:
# of countries - rank 1 reached 0, 0, 1, 9
# of countries - rank 5 reached 0, 8, 49, 29
# of countries - rank 10 reached 25, 31, 49, 32
# of countries - rank 100 reached 49, 49, 49, 32
# of countries - rank 500 reached 49, 49, 49, 32
# of countries - rank 1000 reached 49, 49, 49, 32
[Finished in 0.5s]
Changing storeTable to tables[1] and storeValueRows to tables[3] will print out:
Country
Canada 6, 5, 2, 1
Brazil 7, 5, 2, 1
Hungary 7, 6, 2, 2
Sweden 9, 5, 1, 1
Malaysia 10, 5, 2, 1
Mexico 10, 5, 2, 2
Greece 10, 6, 2, 1
Israel 10, 6, 2, 1
Bulgaria 10, 6, 2, -
Chile 10, 6, 2, -
Vietnam 10, 6, 2, -
Ireland 10, 6, 2, -
Kuwait 10, 6, 2, -
Finland 10, 7, 2, -
United Arab Emirates 10, 7, 2, -
Argentina 10, 7, 2, -
Slovakia 10, 7, 2, -
Romania 10, 8, 2, -
Belgium 10, 9, 2, 3
New Zealand 10, 13, 2, -
Portugal 10, 14, 2, -
Indonesia 10, 14, 2, -
South Africa 10, 15, 2, -
Ukraine 10, 15, 2, -
Philippines 10, 16, 2, -
United Kingdom 11, 5, 2, 1
Denmark 11, 6, 2, 2
Australia 12, 9, 2, 3
United States 13, 9, 2, 2
Austria 13, 9, 2, 3
Turkey 14, 5, 2, 1
Egypt 14, 5, 2, 1
Netherlands 14, 8, 2, 2
Spain 14, 11, 2, 4
Thailand 15, 10, 2, 3
Singapore 16, 10, 2, 2
Switzerland 16, 10, 2, 3
Taiwan 17, 12, 2, 4
Poland 17, 13, 2, 5
France 18, 8, 2, 3
Czech Republic 18, 13, 2, 6
Germany 19, 11, 2, 3
Norway 20, 14, 2, 5
India 20, 14, 2, 5
Italy 20, 15, 2, 7
Hong Kong 26, 21, 2, -
Japan 33, 16, 4, 5
Russia 33, 17, 2, 7
South Korea 46, 27, 2, 5
[Finished in 0.6s]
Not the best of code and can be improved further. However, the logic applies well.
Hope this helps.
EDIT:
If you want the format South Korea, 46, 27, 2, 5 instead of South Korea 46, 27, 2, 5 (note the , after the country name), just change this:
storeRank.append(row.get_text().strip())
to this:
storeRank.append(row.get_text().strip() + ",")
回答2:
Looks like you are scraping data from http://www.appannie.com.
Here is the code to get the data. I am sure some parts of the code can be improved or written in a pythonic way. But it gets what you want. Also, I used Beautiful Soup 4 instead of 3.
from bs4 import BeautifulSoup
html_file = open('test2.html')
soup = BeautifulSoup(html_file)
countries = []
countries_table = soup.find_all('table', attrs={'class':'data-table table-rank'})[1]
countries_body = countries_table.find_all('tbody')[1]
countries_row = countries_body.find_all('tr', attrs={"class": "ranks"})
for row in countries_row:
countries.append(row.div.a.text)
data = []
data_table = soup.find_all('table', attrs={'class':'data-table table-rank'})[3]
data_body = data_table.find_all('tbody')[1]
data_row = data_body.find_all('tr', attrs={"class": "ranks"})
for row in data_row:
tds = row.find_all('td')
sublist = []
for td in tds[::2]:
sublist.append(td.text)
data.append(sublist)
for element in zip(countries, data):
print element
Hope this helps :)
回答3:
Just thought I would put my alternate version of this here. I don't even know why people still use Beautifulsoup for web-scraping, its much easier to directly use XPath through LXML. Here is the same problem, perhaps in an easier to read and update form:
from lxml import html, etree
tree = html.parse("sample.html").xpath('//body/div/div')
lxml_getData = lambda x: "{}, {}, {}, {}".format(lxml_getValue(x.xpath('.//td')[0]), lxml_getValue(x.xpath('.//td')[2]), lxml_getValue(x.xpath('.//td')[4]), lxml_getValue(x.xpath('.//td')[6]))
lxml_getValue = lambda x: etree.tostring(x, method="text", encoding='UTF-8').strip()
locations = tree[0].xpath('.//tbody')[1].xpath('./tr')
locations.pop(0) # Don't need first row
data = tree[1].xpath('.//tbody')[1].xpath('./tr')
data.pop(0) # Don't need first row
for f, b in zip(locations, data):
print(lxml_getValue(f), lxml_getData(b))
来源:https://stackoverflow.com/questions/22410416/beautifulsoup-extracting-data-from-multiple-tables