lxml

How to find direct children of element in lxml

对着背影说爱祢 提交于 2019-12-01 08:29:15
I found an object with specific class: THREAD = TREE.find_class('thread')[0] Now I want to get all <p> elements that are its direct children. I tired: THREAD.findall("p") THREAD.xpath("//div[@class='thread']/p") But all of those returns all <p> elements inside this <div> , no matter if that <div> is their closest parent or not. How can I make it work? Edit: Sample html: <div class='thread'> <p> <!-- 1 --> <!-- Can be some others <p> objects inside, which should not be counted --> </p> <p><!-- 2 --></p> </div> <div class='thread'> <p>[...]</p> <p>[...]</p> </div> script should find two objects

How to solve problem with parsing html file with cyrillic symbol?

那年仲夏 提交于 2019-12-01 08:02:41
I have some html file with span elements: <html> <body> <span class="one">Text</span>some text</br> <span class="two">Привет</span>Текст на русском</br> </body> </html> To get "some text" : # -*- coding:cp1251 -*- import lxml from lxml import html filename = "t.html" fread = open(filename, 'r') source = fread.read() tree = html.fromstring(source) fread.close() tags = tree.xpath('//span[@class="one" and text()="Text"]') #This OK print "name: ",tags[0].text print "value: ",tags[0].tail tags = tree.xpath('//span[@class="two" and text()="Привет"]') #This False print "name: ",tags[0].text print

How to properly escape single and double quotes

江枫思渺然 提交于 2019-12-01 07:32:33
I have a lxml etree HTMLParser object that I'm trying to build xpaths with to assert xpaths, attributes of the xpath and text of that tag. I ran into a problem when the text of the tag has either single-quotes(') or double-quotes(") and I've exhausted all my options. Here's a sample object I created parser = etree.HTMLParser() tree = etree.parse(StringIO(<html><body><p align="center">Here is my 'test' "string"</p></body></html>), parser) Here is the snippet of code and then different variations of the variable being read in def getXpath(self) xpath += 'starts-with(., \'' + self.text + '\') and

Parsing Large XML file with Python lxml and Iterparse

戏子无情 提交于 2019-12-01 07:28:47
问题 I'm attempting to write a parser using lxml and the iterparse method to step through a very large xml file containing many items. My file is of the format: <item> <title>Item 1</title> <desc>Description 1</desc> <url> <item>http://www.url1.com</item> </url> </item> <item> <title>Item 2</title> <desc>Description 2</desc> <url> <item>http://www.url2.com</item> </url> </item> and so far my solution is: from lxml import etree context = etree.iterparse( MYFILE, tag='item' ) for event, elem in

Unable to write extracted items properly in an excel file?

99封情书 提交于 2019-12-01 07:27:09
问题 I've written some code in python to parse title and link from a webpage. Initially, I tried to parse the links from the left sided bar then scrape those aforesaid documents from each page by tracking down each links. I did this flawlessly. I tried to save the documents of different links in different pages in a single excel file. However, It creates several "Sheets" extracting the desired portion as the sheet name from heading variable from my script. The problem I'm facing is- when the data

Why am I getting this ImportError?

被刻印的时光 ゝ 提交于 2019-12-01 07:10:51
I have a tkinter app that I am compiling to an .exe via py2exe . In the setup file, I have set it to include lxml , urllib , lxml.html , ast , and math . When I run python setup.py py2exe in a CMD console, it compiles fine. I then go to the dist folder It has created, and run the .exe file. When I run the .exe I get this popup window. (source: gyazo.com ) I then procede to open the Trader.exe.log file, and the the contents say the following; Traceback (most recent call last): File "Trader.py", line 1, in <module> File "lxml\html\__init__.pyc", line 42, in <module> File "lxml\etree.pyc", line

How to find direct children of element in lxml

泪湿孤枕 提交于 2019-12-01 06:32:02
问题 I found an object with specific class: THREAD = TREE.find_class('thread')[0] Now I want to get all <p> elements that are its direct children. I tired: THREAD.findall("p") THREAD.xpath("//div[@class='thread']/p") But all of those returns all <p> elements inside this <div> , no matter if that <div> is their closest parent or not. How can I make it work? Edit: Sample html: <div class='thread'> <p> <!-- 1 --> <!-- Can be some others <p> objects inside, which should not be counted --> </p> <p><!--

How to search for content in XPath in multiline text using Python?

倖福魔咒の 提交于 2019-12-01 06:00:02
问题 When I search for the existence of data in text() of an element using contains, it works for plain data but not when there are carriage returns, new lines/tags in the element content. How to make //td[contains(text(), "")] work in this case? Thank you! XML : <table> <tr> <td> Hello world <i> how are you? </i> Have a wonderful day. Good bye! </td> </tr> <tr> <td> Hello NJ <i>, how are you? Have a wonderful day.</i> </td> </tr> </table> Python : >>> tdout=open('tdmultiplelines.htm', 'r') >>>

How to update XML file with lxml

南笙酒味 提交于 2019-12-01 03:50:32
I want to update xml file with new information by using lxml library. For example, I have this code: >>> from lxml import etree >>> >>> tree = etree.parse('books.xml') where 'books.xml' file, has this content: http://www.w3schools.com/dom/books.xml I want to update this file with new book: >>> new_entry = etree.fromstring('''<book category="web" cover="paperback"> ... <title lang="en">Learning XML 2</title> ... <author>Erik Ray</author> ... <year>2006</year> ... <price>49.95</price> ... </book>''') My question is, how can I update tree element tree with new_entry tree and save the file. Here

Automatic XSD validation

浪尽此生 提交于 2019-12-01 03:32:51
According to the lxml documentation "The DTD is retrieved automatically based on the DOCTYPE of the parsed document. All you have to do is use a parser that has DTD validation enabled." http://lxml.de/validation.html#validation-at-parse-time However, if you want to validate against an XML schema, you need to explicitly reference one. I am wondering why this is and would like to know if there is a library or function that can do this. Or even an explanation of how to make this happen myself. The problem is there seems to be many ways to reference an XSD and I need to support all of them.