Web Crawler To get Links From New Website

后端未结

关注

 3  1644

忘了有多久 2021-01-26 10:49

I am trying to get the links from a news website page(from one of its archives). I wrote the following lines of code in Python:

main.py contains :

3条回答

慢半拍i (楼主)

2021-01-26 11:27

You might want to use the powerful XPath query language with the faster lxml module. As simple as that:

import urllib2
from lxml import etree

url = 'http://www.thehindu.com/archive/web/2010/06/19/'
html = etree.HTML(urllib2.urlopen(url).read())

for link in html.xpath("//li[@data-section='Business']/a"):
    print '{} ({})'.format(link.text, link.attrib['href'])

Update for @data-section='Chennai'

#!/usr/bin/python
import urllib2
from lxml import etree

url = 'http://www.thehindu.com/template/1-0-1/widget/archive/archiveWebDayRest.jsp?d=2010-06-19'
html = etree.HTML(urllib2.urlopen(url).read())

for link in html.xpath("//li[@data-section='Chennai']/a"):
    print '{} => {}'.format(link.text, link.attrib['href'])

0 讨论(0)

查看其它3个回答