lxml | 易学教程

How to find direct children of element in lxml

阅读更多关于 How to find direct children of element in lxml

I found an object with specific class: THREAD = TREE.find_class('thread')[0] Now I want to get all <p> elements that are its direct children. I tired: THREAD.findall("p") THREAD.xpath("//div[@class='thread']/p") But all of those returns all <p> elements inside this <div> , no matter if that <div> is their closest parent or not. How can I make it work? Edit: Sample html: <div class='thread'> <p>   </p> <p></p> </div> <div class='thread'> <p>[...]</p> <p>[...]</p> </div> script should find two objects

How to solve problem with parsing html file with cyrillic symbol?

阅读更多关于 How to solve problem with parsing html file with cyrillic symbol?

I have some html file with span elements: <html> <body> <span class="one">Text</span>some text</br> <span class="two">Привет</span>Текст на русском</br> </body> </html> To get "some text" : # -*- coding:cp1251 -*- import lxml from lxml import html filename = "t.html" fread = open(filename, 'r') source = fread.read() tree = html.fromstring(source) fread.close() tags = tree.xpath('//span[@class="one" and text()="Text"]') #This OK print "name: ",tags[0].text print "value: ",tags[0].tail tags = tree.xpath('//span[@class="two" and text()="Привет"]') #This False print "name: ",tags[0].text print

How to properly escape single and double quotes

阅读更多关于 How to properly escape single and double quotes

I have a lxml etree HTMLParser object that I'm trying to build xpaths with to assert xpaths, attributes of the xpath and text of that tag. I ran into a problem when the text of the tag has either single-quotes(') or double-quotes(") and I've exhausted all my options. Here's a sample object I created parser = etree.HTMLParser() tree = etree.parse(StringIO(<html><body><p align="center">Here is my 'test' "string"</p></body></html>), parser) Here is the snippet of code and then different variations of the variable being read in def getXpath(self) xpath += 'starts-with(., \'' + self.text + '\') and

Parsing Large XML file with Python lxml and Iterparse

阅读更多关于 Parsing Large XML file with Python lxml and Iterparse

问题 I'm attempting to write a parser using lxml and the iterparse method to step through a very large xml file containing many items. My file is of the format: <item> <title>Item 1</title> <desc>Description 1</desc> <url> <item>http://www.url1.com</item> </url> </item> <item> <title>Item 2</title> <desc>Description 2</desc> <url> <item>http://www.url2.com</item> </url> </item> and so far my solution is: from lxml import etree context = etree.iterparse( MYFILE, tag='item' ) for event, elem in

Unable to write extracted items properly in an excel file?

阅读更多关于 Unable to write extracted items properly in an excel file?

问题 I've written some code in python to parse title and link from a webpage. Initially, I tried to parse the links from the left sided bar then scrape those aforesaid documents from each page by tracking down each links. I did this flawlessly. I tried to save the documents of different links in different pages in a single excel file. However, It creates several "Sheets" extracting the desired portion as the sheet name from heading variable from my script. The problem I'm facing is- when the data

Why am I getting this ImportError?

阅读更多关于 Why am I getting this ImportError?

I have a tkinter app that I am compiling to an .exe via py2exe . In the setup file, I have set it to include lxml , urllib , lxml.html , ast , and math . When I run python setup.py py2exe in a CMD console, it compiles fine. I then go to the dist folder It has created, and run the .exe file. When I run the .exe I get this popup window. (source: gyazo.com ) I then procede to open the Trader.exe.log file, and the the contents say the following; Traceback (most recent call last): File "Trader.py", line 1, in <module> File "lxml\html\__init__.pyc", line 42, in <module> File "lxml\etree.pyc", line

How to find direct children of element in lxml

阅读更多关于 How to find direct children of element in lxml

问题 I found an object with specific class: THREAD = TREE.find_class('thread')[0] Now I want to get all <p> elements that are its direct children. I tired: THREAD.findall("p") THREAD.xpath("//div[@class='thread']/p") But all of those returns all <p> elements inside this <div> , no matter if that <div> is their closest parent or not. How can I make it work? Edit: Sample html: <div class='thread'> <p>   </p> <p><!--

How to search for content in XPath in multiline text using Python?

阅读更多关于 How to search for content in XPath in multiline text using Python?

问题 When I search for the existence of data in text() of an element using contains, it works for plain data but not when there are carriage returns, new lines/tags in the element content. How to make //td[contains(text(), "")] work in this case? Thank you! XML : <table> <tr> <td> Hello world <i> how are you? </i> Have a wonderful day. Good bye! </td> </tr> <tr> <td> Hello NJ <i>, how are you? Have a wonderful day.</i> </td> </tr> </table> Python : >>> tdout=open('tdmultiplelines.htm', 'r') >>>

How to update XML file with lxml

阅读更多关于 How to update XML file with lxml

I want to update xml file with new information by using lxml library. For example, I have this code: >>> from lxml import etree >>> >>> tree = etree.parse('books.xml') where 'books.xml' file, has this content: http://www.w3schools.com/dom/books.xml I want to update this file with new book: >>> new_entry = etree.fromstring('''<book category="web" cover="paperback"> ... <title lang="en">Learning XML 2</title> ... <author>Erik Ray</author> ... <year>2006</year> ... <price>49.95</price> ... </book>''') My question is, how can I update tree element tree with new_entry tree and save the file. Here

Automatic XSD validation

阅读更多关于 Automatic XSD validation

According to the lxml documentation "The DTD is retrieved automatically based on the DOCTYPE of the parsed document. All you have to do is use a parser that has DTD validation enabled." http://lxml.de/validation.html#validation-at-parse-time However, if you want to validate against an XML schema, you need to explicitly reference one. I am wondering why this is and would like to know if there is a library or function that can do this. Or even an explanation of how to make this happen myself. The problem is there seems to be many ways to reference an XSD and I need to support all of them.