Web Crawler To get Links From New Website

问题

I am trying to get the links from a news website page(from one of its archives). I wrote the following lines of code in Python:

main.py contains :

import mechanize
from bs4 import BeautifulSoup

url = "http://www.thehindu.com/archive/web/2010/06/19/"

br =  mechanize.Browser()
htmltext = br.open(url).read()

articletext = ""
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
    articletext += tag.contents[0]

print articletext

An example of the object in tag.contents[0] : <a href="http://www.thehindu.com/business/itc-to-issue-11-bonus/article472545.ece" target="_blank">ITC to issue 1:1 bonus</a>

But on running it I am getting the following error :

File "C:\Python27\crawler\main.py", line 4, in <module>
    text = articletext.getArticle(url)
  File "C:\Python27\crawler\articletext.py", line 23, in getArticle
    return getArticleText(htmltext)
  File "C:\Python27\crawler\articletext.py", line 18, in getArticleText
    articletext += tag.contents[0]
TypeError: cannot concatenate 'str' and 'Tag' objects

Can someone help me to sort it out ?? I am new to Python Programming. thanks and regards.

回答1:

you are using link_dictionary vaguely. If you are not using it for reading purpose then try the following code :

 br =  mechanize.Browser()
 htmltext = br.open(url).read()

 articletext = ""
 for tag_li in soup.findAll('li', attrs={"data-section":"Op-Ed"}):
    for link in tag_li.findAll('a'):
        urlnew = urlnew = link.get('href')
        brnew =  mechanize.Browser()
        htmltextnew = brnew.open(urlnew).read()            
        articletext = ""
        soupnew = BeautifulSoup(htmltextnew)
        for tag in soupnew.findAll('p'):
            articletext += tag.text
        print re.sub('\s+', ' ', articletext, flags=re.M)

Note : re is for regulare expression. for this you import the module of re.

回答2:

I believe you may want to try accessing the text inside the list item like so:

for tag in soup.findAll('li', attrs={"data-section":"Business"}):
    articletext += tag.string

Edited: General Comments on getting links from a page

Probably the easiest data type to use to gather a bunch of links and retrieve them later is a dictionary.

To get links from a page using BeautifulSoup, you could do something like the following:

link_dictionary = {}
with urlopen(url_source) as f:
    soup = BeautifulSoup(f)
    for link in soup.findAll('a'):
        link_dictionary[link.string] = link.get('href')

This will provide you with a dictionary named link_dictionary, where every key in the dictionary is a string that is simply the text contents between the <a> </a> tags and every value is the the value of the href attribute.

How to combine this what your previous attempt

Now, if we combine this with the problem you were having before, we could try something like the following:

link_dictionary = {}
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
    for link in tag.findAll('a'):
        link_dictionary[link.string] = link.get('href')

If this doesn't make sense, or you have a lot more questions, you will need to experiment first and try to come up with a solution before asking another new, clearer question.

回答3:

You might want to use the powerful XPath query language with the faster lxml module. As simple as that:

import urllib2
from lxml import etree

url = 'http://www.thehindu.com/archive/web/2010/06/19/'
html = etree.HTML(urllib2.urlopen(url).read())

for link in html.xpath("//li[@data-section='Business']/a"):
    print '{} ({})'.format(link.text, link.attrib['href'])

Update for @data-section='Chennai'

#!/usr/bin/python
import urllib2
from lxml import etree

url = 'http://www.thehindu.com/template/1-0-1/widget/archive/archiveWebDayRest.jsp?d=2010-06-19'
html = etree.HTML(urllib2.urlopen(url).read())

for link in html.xpath("//li[@data-section='Chennai']/a"):
    print '{} => {}'.format(link.text, link.attrib['href'])

来源：https://stackoverflow.com/questions/19914498/web-crawler-to-get-links-from-new-website

标签

python

python-2.7

python-3.x

beautifulsoup