问题
I am trying to read in html websites and extract their data. For example, I would like to read in the EPS (earnings per share) for the past 5 years of companies. Basically, I can read it in and can use either BeautifulSoup or html2text to create a huge text block. I then want to search the file -- I have been using re.search -- but can't seem to get it to work properly. Here is the line I am trying to access:
EPS (Basic)\n13.4620.6226.6930.1732.81\n\n
So I would like to create a list called EPS = [13.46, 20.62, 26.69, 30.17, 32.81].
Thanks for any help.
from stripogram import html2text
from urllib import urlopen
import re
from BeautifulSoup import BeautifulSoup
ticker_symbol = 'goog'
url = 'http://www.marketwatch.com/investing/stock/'
full_url = url + ticker_symbol + '/financials' #build url
text_soup = BeautifulSoup(urlopen(full_url).read()) #read in
text_parts = text_soup.findAll(text=True)
text = ''.join(text_parts)
eps = re.search("EPS\s+(\d+)", text)
if eps is not None:
print eps.group(1)
回答1:
It's not a good practice to use regex for parsing html. Use BeautifulSoup
parser: find the cell with rowTitle
class and EPS (Basic)
text in it, then iterate over next siblings with valueCell
class:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://www.marketwatch.com/investing/stock/goog/financials'
text_soup = BeautifulSoup(urlopen(url).read()) #read in
titles = text_soup.findAll('td', {'class': 'rowTitle'})
for title in titles:
if 'EPS (Basic)' in title.text:
print [td.text for td in title.findNextSiblings(attrs={'class': 'valueCell'}) if td.text]
prints:
['13.46', '20.62', '26.69', '30.17', '32.81']
Hope that helps.
回答2:
I would take a very different approach. We use LXML for scraping html pages
One of the reasons we switched was because BS was not being maintained for a while - or I should say updated.
In my test I ran the following
import requests
from lxml import html
from collections import OrderedDict
page_as_string = requests.get('http://www.marketwatch.com/investing/stock/goog/financials').content
tree = html.fromstring(page_as_string)
Now I looked at the page and I see the data is divided into two tables. Since you want EPS, I noted that it is in the second table. We could write some code to sort this out programmatically but I will leave that for you.
tables = [ e for e in tree.iter() if e.tag == 'table']
eps_table = tables[-1]
now I noticed that the first row has the column headings, so I want to separate all of the rows
table_rows = [ e for e in eps_table.iter() if e.tag == 'tr']
now lets get the column headings:
column_headings =[ e.text_content() for e in table_rows[0].iter() if e.tag == 'th']
Finally we can map the column headings to the row labels and cell values
my_results = []
for row in table_rows[1:]:
cell_content = [ e.text_content() for e in row.iter() if e.tag == 'td']
temp_dict = OrderedDict()
for numb, cell in enumerate(cell_content):
if numb == 0:
temp_dict['row_label'] = cell.strip()
else:
dict_key = column_headings[numb]
temp_dict[dict_key] = cell
my_results.append(temp_dict)
now to access the results
for row_dict in my_results:
if row_dict['row_label'] == 'EPS (Basic)':
for key in row_dict:
print key, ':', row_dict[key]
row_label : EPS (Basic)
2008 : 13.46
2009 : 20.62
2010 : 26.69
2011 : 30.17
2012 : 32.81
5-year trend :
Now there is still more to do, for example I did not test for squareness (number of cells in each row is equal).
Finally I am a novice and I suspect others will advise more direct methods of getting at these elements (xPath or cssselect) but this does work and it gets you everything from the table in a nice structured manner.
I should add that every row from the table is available, they are in the original row order. The first item (which is a dictionary) in the my_results list has the data from the first row, the second item has the data from the second row etc.
When I need a new build of lxml I visit a page maintained by a really nice guy at UC-IRVINE
I hope this helps
回答3:
from bs4 import BeautifulSoup
import urllib2
import lxml
import pandas as pd
url = 'http://markets.ft.com/research/Markets/Tearsheets/Financials?s=CLLN:LSE&subview=BalanceSheet'
soup = BeautifulSoup(urllib2.urlopen(url).read())
table = soup.find('table', {'data-ajax-content' : 'true'})
data = []
for row in table.findAll('tr'):
cells = row.findAll('td')
cols = [ele.text.strip() for ele in cells]
data.append([ele for ele in cols if ele])
df = pd.DataFrame(data)
print df
dictframe = df.to_dict()
print dictframe
The above code will give you a DataFrame from the webpage and then uses that to create a python dictionary.
来源:https://stackoverflow.com/questions/17709058/parsing-html-data-into-python-list-for-manipulation