I\'m working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed.
import lxml.etree as ET
body = t.xpath("//body");
for tag in body:
h = html.fromstring( ET.tostring(tag[0]) ).xpath("//h1");
p = html.fromstring( ET.tostring(tag[1]) ).xpath("//p");
htext = h[0].text_content();
ptext = h[0].text_content();
you can also use .get('href') for a tag and .attrib for attribute ,
here tag no is hardcoded but you can also do this dynamic