lxml

How to extract links from a webpage using lxml, XPath and Python?

徘徊边缘 提交于 2019-12-07 06:22:37
问题 I've got this xpath query: /html/body//tbody/tr[*]/td[*]/a[@title]/@href It extracts all the links with the title attribute - and gives the href in FireFox's Xpath checker add-on. However, I cannot seem to use it with lxml . from lxml import etree parsedPage = etree.HTML(page) # Create parse tree from valid page. # Xpath query hyperlinks = parsedPage.xpath("/html/body//tbody/tr[*]/td[*]/a[@title]/@href") for x in hyperlinks: print x # Print links in <a> tags, containing the title attribute

From escaped html -> to regular html? - Python

为君一笑 提交于 2019-12-07 02:56:49
问题 I used BeautifulSoup to handle XML files that I have collected through a REST API. The responses contain HTML code, but BeautifulSoup escapes all the HTML tags so it can be displayed nicely. Unfortunately I need the HTML code. How would I go on about transforming the escaped HTML into proper markup? Help would be very much appreciated! 回答1: I think you want xml.sax.saxutils.unescape from the Python standard library. E.g.: >>> from xml.sax import saxutils as su >>> s = '<foo>bar</foo>' >>> su

how to make python request.get wait a few seconds?

荒凉一梦 提交于 2019-12-07 02:43:29
I wanted to do get some experience with html crawling, so I wanted to see if I could grab some values of the following site: http://www.iex.nl/Aandeel-Koers/11890/Royal-Imtech/koers.aspx This site shows the price of imtech shares. If you take a look at the site, you see there is 1 number shown in bold, this is the price of the share. As you may have seen, this price changes, and that's okay. I only want the value at the time I run my script at this point in time. but if you reload the page, you may notice how it first shows "laatste koers" and after a delay of 1 second it shows "realtime" As

python setuptool how can I add dependency for libxml2-dev and libxslt1-dev?

左心房为你撑大大i 提交于 2019-12-07 00:39:38
My application needs lxml >= 2.1, but to install lxml its requied to install libxml2-dev libxslt1-dev else it raises error while installing the lxml, is there a way that using python setup tool I can give this as dependency in my setup.py.... ohe Not really ... setuptools only handle dependencies on package wich belongs already to pypi. So if you want these kind of dependencies, i think that you have to select the packaging technology brought by your favorite distribution. But, you can override your setuptools build or install command to make extra check before installing the package. To do so

How to remove namespace value from inside lxml.html.html5paser element tag

一个人想着一个人 提交于 2019-12-06 16:47:31
Is it possible not to add namespace for the tag when using html5parser from the lxml.html package? Example: from lxml import html print(html.parse('http://example.com').getroot().tag) # You will get 'html' from lxml.html import html5parser print(html5parser.parse('http://example.com').getroot().tag) # You will get '{http://www.w3.org/1999/xhtml}html' The easiest solution I found is to remove that using regex, but maybe it's possible not to include that text at all? There is a specific namespaceHTMLElements boolean flag that controls this behavior: from lxml.html import html5parser from

Retrieving a subset of href's from findall() in BeautifulSoup

一曲冷凌霜 提交于 2019-12-06 15:43:48
My goal is to write a python script that takes an artist's name as a string input and then appends it to the base URL that goes to the genius search query.Then retrieves all the lyrics from the returned web page's links (Which is the required subset of this problem that will also contain specifically the artist name in every link in that subset.).I am in the initial phase right now and just have been able to retrieve all links from the web page including the ones that I don't want in my subset. I tried to find a simple solution but failed continuously. import requests # The Requests library.

PYTHON : How to add root node to an XML

房东的猫 提交于 2019-12-06 15:34:18
I have an xml file looks something like this <A> <B> <C> .... </C> </B> </A> I want to add root on top of element 'A'. I found out a way to add elements to root. But How to change existing root and add on top of it using python. After adding root to the xml it should look like this <ROOT> <A> <B> <C> .... </C> </B> </A> </ROOT> import lxml.etree as ET tree = ET.parse('data') root = tree.getroot() newroot = ET.Element("root") newroot.insert(0, root) print(ET.tostring(newroot, pretty_print=True)) yields <root> <A> <B> <C> .... </C> </B> </A> </root> But really, unless you need to add something

How to split the tags from html tree

时光总嘲笑我的痴心妄想 提交于 2019-12-06 15:28:44
问题 This is my html tree <li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1"> Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a> </h3>Get the IndianOil Citibank <b>Card</b>. Apply Now! <br /> <a href="e%253DGOOGLE ------">Get 10X Rewards On Shopping</a> - <a href="S%2526eOfferCode%253DGSCCSLEX ------">Save Over 5% On Fuel</a> <br /> <cite>www.citibank.co.in/<b>CreditCards</b></cite> </li> From this html i need to extract the lines beforeth of < br > tag

Accesing values in xml file with namespaces in python 2.7 lxml

你说的曾经没有我的故事 提交于 2019-12-06 14:55:54
I'm following this link to try to get values of several tags: Parsing XML with namespace in Python via 'ElementTree' In this link there is no problem to access to the root tag like this: import sys from lxml import etree as ET doc = ET.parse('file.xml') namespaces_rdf = {'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'} # add more as needed namespaces_dcat = {'dcat': 'http://www.w3.org/ns/dcat#'} # add more as needed namespaces_dct = {'dct': 'http://purl.org/dc/terms/'} print doc.findall('rdf:RDF', namespaces_rdf) print doc.findall('dcat:Dataset', namespaces_dcat) print doc.findall('dct

Why can't I install lxml for python?

Deadly 提交于 2019-12-06 14:43:44
问题 I have downloaded the tarball for lxml and am using ipython setup.py install to try to install it. Unfortunately it is giving me screenfuls of error messages: src/lxml/lxml.etree.c:200651: error: ‘XML_XPATH_INVALID_OPERAND’ undeclared (first use in this function) src/lxml/lxml.etree.c:200661: error: ‘XML_XPATH_INVALID_TYPE’ undeclared (first use in this function) src/lxml/lxml.etree.c:200671: error: ‘XML_XPATH_INVALID_ARITY’ undeclared (first use in this function) src/lxml/lxml.etree.c:200681