I am trying to get the links from a news website page(from one of its archives). I wrote the following lines of code in Python:
main.py
contains :
You might want to use the powerful XPath query language with the faster lxml module. As simple as that:
import urllib2
from lxml import etree
url = 'http://www.thehindu.com/archive/web/2010/06/19/'
html = etree.HTML(urllib2.urlopen(url).read())
for link in html.xpath("//li[@data-section='Business']/a"):
print '{} ({})'.format(link.text, link.attrib['href'])
Update for @data-section='Chennai'
#!/usr/bin/python
import urllib2
from lxml import etree
url = 'http://www.thehindu.com/template/1-0-1/widget/archive/archiveWebDayRest.jsp?d=2010-06-19'
html = etree.HTML(urllib2.urlopen(url).read())
for link in html.xpath("//li[@data-section='Chennai']/a"):
print '{} => {}'.format(link.text, link.attrib['href'])