get contents of tags using python

问题

Assuming I have html read into my program like this:

<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html">F/T &amp; P/T Sales Associate - Caliente Fashions</a> - <font size="-1"> (North Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817804151.html">IMMEDIATE EMPLOYMENT WANTED!</a> - </p>

<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html">TRAVEL AGENT</a> - <font size="-1"> (NORTH VANCOUVER)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html">Optical Sales Position</a> - <font size="-1"> (New Westminster)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817709780.html">Sales Clerk</a> - <font size="-1"> (Kits)</font></p>

<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817676850.html">MARINE SALES</a> - <font size="-1"> (VANCOUVER ( KITS ))</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817608506.html">Retail Sales Associate</a> - <font size="-1"> (Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817573985.html">Retail with small parts appliance background</a> - </p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817540938.html">Manager *Enjoyable work atmosphere</a> - <font size="-1"> (Langley Centre)</font></p>

<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html">Team Member - Retail Store - FT</a> - <font size="-1"> (Burnaby South)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817459155.html">STORE MANAGER-SHOE WAREHOUSE</a> - <font size="-1"> (South Surrey-Semiahmoo)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/pml/ret/1817448777.html">Retail Sales</a> - <font size="-1"> (Coquitlam)</font></p>

How do I grab the contents of the text node? What I would like to end up with is printing something similar to this line in the terminal:

http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - TRAVEL AGENT

So far I have the following code which extracts the href link fine but I'm not sure how to extract the data itself. I'm thinking of overriding handle_data(self, data) from the sgmllib.py module but so far I can't seem to think of a way to do it.

from sgmllib import SGMLParser

class URLLister(SGMLParser):
    def reset(self):
        SGMLParser.reset(self)
        self.urls = []

    def start_a(self, attrs):
        href = [v for k, v in attrs if k == "href"]
        if href:
            self.urls.extend(href)

Thanks!

回答1:

Simplest is probably BeautifulSoup (be sure to use 3.0.8 or higher 3.0.* release, not 3.1.*, unless you're on Python 3 -- see here!).

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(thehtmlstring)

for anchor in soup.findAll('a'):
  print anchor['href'], anchor.string

BeautifulSoup produce unicode strings -- if that's a problem, be sure to encode them as you wish to get the byte strings the way you want them!

回答2:

Personally I would use lxml. Once installed, getting what you want is simple:

from lxml import html

tree = html.fromstring(open("data.html").read())

print [e.text_content() for e in tree.xpath("//a")]

回答3:

SGMLParser has been deprecated in Python 2.6, and will go away in 3.0. You probably want to use the HTMLParser module instead. I've never used it before (I always just use BeutifulSoup for these kind of things), so I figured I'd learn how it works. Here's a sample script I put together that should get you what you want.

#!/usr/bin/env python

from HTMLParser import HTMLParser

class URLParser(HTMLParser):
    def __init__(self):
        self.in_link = False
        self.links = []
        self.current_link = ''
        HTMLParser.__init__(self)

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            self.current_link = self.get_href_from_attrs(attrs)
            self.in_link = True

    def handle_endtag(self, tag):
        if tag == 'a':
            self.links.append(self.current_link)
            self.in_link = False

    def handle_data(self, data):
        if self.in_link:
            self.current_link = '%s - %s' % (self.current_link, data)

    def get_href_from_attrs(self, attrs):
        # The attrs dict is a list of tuples like:
        #  [('href', 'www.google.com'), ('class', 'some-class')]
        for prop, val in attrs:
            if prop == 'href':
                return val
        return ''

if __name__ == '__main__':
    the_html = '''
<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html">F/T &amp; P/T Sales Associate - Caliente Fashions</a> - <font size="-1"> (North Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817804151.html">IMMEDIATE EMPLOYMENT WANTED!</a> - </p>

<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html">TRAVEL AGENT</a> - <font size="-1"> (NORTH VANCOUVER)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html">Optical Sales Position</a> - <font size="-1"> (New Westminster)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817709780.html">Sales Clerk</a> - <font size="-1"> (Kits)</font></p>

<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817676850.html">MARINE SALES</a> - <font size="-1"> (VANCOUVER ( KITS ))</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817608506.html">Retail Sales Associate</a> - <font size="-1"> (Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817573985.html">Retail with small parts appliance background</a> - </p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817540938.html">Manager *Enjoyable work atmosphere</a> - <font size="-1"> (Langley Centre)</font></p>

<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html">Team Member - Retail Store - FT</a> - <font size="-1"> (Burnaby South)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817459155.html">STORE MANAGER-SHOE WAREHOUSE</a> - <font size="-1"> (South Surrey-Semiahmoo)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/pml/ret/1817448777.html">Retail Sales</a> - <font size="-1"> (Coquitlam)</font></p>
    '''
    url_parser = URLParser()
    url_parser.feed(the_html)

    print '\n'.join(url_parser.links)

Output

http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - F/T  -  P/T Sales Associate - Caliente Fashions
http://vancouver.en.craigslist.ca/van/ret/1817804151.html - IMMEDIATE EMPLOYMENT WANTED!
http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html - TRAVEL AGENT
http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html - Optical Sales Position
http://vancouver.en.craigslist.ca/van/ret/1817709780.html - Sales Clerk
http://vancouver.en.craigslist.ca/van/ret/1817676850.html - MARINE SALES
http://vancouver.en.craigslist.ca/van/ret/1817608506.html - Retail Sales Associate
http://vancouver.en.craigslist.ca/van/ret/1817573985.html - Retail with small parts appliance background
http://vancouver.en.craigslist.ca/rds/ret/1817540938.html - Manager *Enjoyable work atmosphere
http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html - Team Member - Retail Store - FT
http://vancouver.en.craigslist.ca/rds/ret/1817459155.html - STORE MANAGER-SHOE WAREHOUSE
http://vancouver.en.craigslist.ca/pml/ret/1817448777.html - Retail Sales

Update: After going through that little exercise the interface to this just feels gross, so I'm just going to stick with the much cleaner BeutifulSoup library. See Alex's sample to see how it's done.

回答4:

As long as we're comparing options, this pyparsing snippet also gives you the location for each position, given in the <font> tag following the closing <a> tag:

from pyparsing import makeHTMLTags, SkipTo

a,aEnd = makeHTMLTags("A")
font,fontEnd = makeHTMLTags("FONT")
p,pEnd = makeHTMLTags("P")

patt = (p + a("a") + SkipTo(aEnd)("posn") + aEnd + '-' + 
        font + SkipTo(fontEnd)("locn") + fontEnd + pEnd)

for tokens,_,_ in patt.scanString(the_html):
    print tokens.a.href, '-', tokens.posn, tokens.locn

Gives:

http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - F/T &amp; P/T Sales Associate - Caliente Fashions (North Vancouver)
http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html - TRAVEL AGENT (NORTH VANCOUVER)
http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html - Optical Sales Position (New Westminster)
http://vancouver.en.craigslist.ca/van/ret/1817709780.html - Sales Clerk (Kits)
http://vancouver.en.craigslist.ca/van/ret/1817676850.html - MARINE SALES (VANCOUVER ( KITS ))
http://vancouver.en.craigslist.ca/van/ret/1817608506.html - Retail Sales Associate (Vancouver)
http://vancouver.en.craigslist.ca/rds/ret/1817540938.html - Manager *Enjoyable work atmosphere (Langley Centre)
http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html - Team Member - Retail Store - FT (Burnaby South)
http://vancouver.en.craigslist.ca/rds/ret/1817459155.html - STORE MANAGER-SHOE WAREHOUSE (South Surrey-Semiahmoo)
http://vancouver.en.craigslist.ca/pml/ret/1817448777.html - Retail Sales (Coquitlam)

回答5:

#download BeautifulSoup library for python
from Beautiful import *

fh = open('data.html')
html = fh.read()
soup = BeautifulSoup(html)

tags = soup('a')

for tag in tags:
    print tag.contents[0]

来源：https://stackoverflow.com/questions/3145178/get-contents-of-a-tags-using-python

标签

python