Web Crawler To get Links From New Website

后端 未结 3 1629
忘了有多久
忘了有多久 2021-01-26 10:49

I am trying to get the links from a news website page(from one of its archives). I wrote the following lines of code in Python:

main.py contains :



        
3条回答
  •  慢半拍i
    慢半拍i (楼主)
    2021-01-26 11:27

    You might want to use the powerful XPath query language with the faster lxml module. As simple as that:

    import urllib2
    from lxml import etree
    
    url = 'http://www.thehindu.com/archive/web/2010/06/19/'
    html = etree.HTML(urllib2.urlopen(url).read())
    
    for link in html.xpath("//li[@data-section='Business']/a"):
        print '{} ({})'.format(link.text, link.attrib['href'])
    

    Update for @data-section='Chennai'

    #!/usr/bin/python
    import urllib2
    from lxml import etree
    
    url = 'http://www.thehindu.com/template/1-0-1/widget/archive/archiveWebDayRest.jsp?d=2010-06-19'
    html = etree.HTML(urllib2.urlopen(url).read())
    
    for link in html.xpath("//li[@data-section='Chennai']/a"):
        print '{} => {}'.format(link.text, link.attrib['href'])
    

提交回复
热议问题