Parsing html data into python list for manipulation

问题

I am trying to read in html websites and extract their data. For example, I would like to read in the EPS (earnings per share) for the past 5 years of companies. Basically, I can read it in and can use either BeautifulSoup or html2text to create a huge text block. I then want to search the file -- I have been using re.search -- but can't seem to get it to work properly. Here is the line I am trying to access:

EPS (Basic)\n13.4620.6226.6930.1732.81\n\n

So I would like to create a list called EPS = [13.46, 20.62, 26.69, 30.17, 32.81].

Thanks for any help.

from stripogram import html2text
from urllib import urlopen
import re
from BeautifulSoup import BeautifulSoup

ticker_symbol = 'goog'
url = 'http://www.marketwatch.com/investing/stock/'
full_url = url + ticker_symbol + '/financials'  #build url

text_soup = BeautifulSoup(urlopen(full_url).read()) #read in 

text_parts = text_soup.findAll(text=True)
text = ''.join(text_parts)

eps = re.search("EPS\s+(\d+)", text)
if eps is not None:
    print eps.group(1)

回答1:

It's not a good practice to use regex for parsing html. Use BeautifulSoup parser: find the cell with rowTitle class and EPS (Basic) text in it, then iterate over next siblings with valueCell class:

from urllib import urlopen
from BeautifulSoup import BeautifulSoup

url = 'http://www.marketwatch.com/investing/stock/goog/financials'
text_soup = BeautifulSoup(urlopen(url).read()) #read in

titles = text_soup.findAll('td', {'class': 'rowTitle'})
for title in titles:
    if 'EPS (Basic)' in title.text:
        print [td.text for td in title.findNextSiblings(attrs={'class': 'valueCell'}) if td.text]

prints:

['13.46', '20.62', '26.69', '30.17', '32.81']

Hope that helps.

回答2:

I would take a very different approach. We use LXML for scraping html pages

One of the reasons we switched was because BS was not being maintained for a while - or I should say updated.

In my test I ran the following

import requests
from lxml import html
from collections import OrderedDict
page_as_string = requests.get('http://www.marketwatch.com/investing/stock/goog/financials').content

tree = html.fromstring(page_as_string)

Now I looked at the page and I see the data is divided into two tables. Since you want EPS, I noted that it is in the second table. We could write some code to sort this out programmatically but I will leave that for you.

tables = [ e for e in tree.iter() if e.tag == 'table']
eps_table = tables[-1]

now I noticed that the first row has the column headings, so I want to separate all of the rows

table_rows = [ e for e in eps_table.iter() if e.tag == 'tr']

now lets get the column headings:

column_headings =[ e.text_content() for e in table_rows[0].iter() if e.tag == 'th']

Finally we can map the column headings to the row labels and cell values

my_results = []
for row in table_rows[1:]:
    cell_content = [ e.text_content() for e in row.iter() if e.tag == 'td']
    temp_dict = OrderedDict()
    for numb, cell in enumerate(cell_content):
        if numb == 0:
            temp_dict['row_label'] = cell.strip()
         else:
            dict_key = column_headings[numb]
            temp_dict[dict_key] = cell

    my_results.append(temp_dict)

now to access the results

for row_dict in my_results:
    if row_dict['row_label'] == 'EPS (Basic)':
        for key in row_dict:
            print key, ':', row_dict[key]   


row_label :  EPS (Basic)
2008 : 13.46
2009 : 20.62
2010 : 26.69
2011 : 30.17
2012 : 32.81
5-year trend :

Now there is still more to do, for example I did not test for squareness (number of cells in each row is equal).

Finally I am a novice and I suspect others will advise more direct methods of getting at these elements (xPath or cssselect) but this does work and it gets you everything from the table in a nice structured manner.

I should add that every row from the table is available, they are in the original row order. The first item (which is a dictionary) in the my_results list has the data from the first row, the second item has the data from the second row etc.

When I need a new build of lxml I visit a page maintained by a really nice guy at UC-IRVINE

I hope this helps

回答3:

from bs4 import BeautifulSoup
import urllib2
import lxml
import pandas as pd

url = 'http://markets.ft.com/research/Markets/Tearsheets/Financials?s=CLLN:LSE&subview=BalanceSheet'

soup = BeautifulSoup(urllib2.urlopen(url).read())

table = soup.find('table', {'data-ajax-content' : 'true'})

data = []

for row in table.findAll('tr'):
    cells = row.findAll('td')
    cols = [ele.text.strip() for ele in cells]
    data.append([ele for ele in cols if ele])

df = pd.DataFrame(data)

print df

dictframe = df.to_dict()

print dictframe

The above code will give you a DataFrame from the webpage and then uses that to create a python dictionary.

来源：https://stackoverflow.com/questions/17709058/parsing-html-data-into-python-list-for-manipulation

标签

python

html

regex

beautifulsoup

html-parsing