lxml | 易学教程

How can I parse HTML with html5lib, and query the parsed HTML with XPath?

阅读更多关于 How can I parse HTML with html5lib, and query the parsed HTML with XPath?

问题 I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a table: <html> <table> <tr><td>Header</td></tr> <tr><td>Want This</td></tr> </table> </html> so lets try it: >>> doc = html5lib.parse('<html><table><tr><td>Header</td></tr><tr><td>Want This</td> </tr></table></html>', treebuilder='lxml') >>> doc <lxml

Generating xml in python and lxml

阅读更多关于 Generating xml in python and lxml

问题 I have this xml from sql, and I want to do the same by python 2.7 and lxml <?xml version="1.0" encoding="utf-16"?> <results> <Country name="Germany" Code="DE" Storage="Basic" Status="Fresh" Type="Photo" /> </results> Now I have: from lxml import etree # create XML results= etree.Element('results') country= etree.Element('country') country.text = 'Germany' root.append(country) filename = "xmltestthing.xml" FILE = open(filename,"w") FILE.writelines(etree.tostring(root, pretty_print=True)) FILE

Character encoding in python to replace 'u2019' with '

阅读更多关于 Character encoding in python to replace 'u2019' with '

问题 I have tried numerous ways to encode this to the end result "BACK RUSHIN'" with the most important character being the right apostrophe ' . I would like a way of getting to this end result using some of the built in functions Python has where there is no discrimination between a normal string and a unicode string. This was the code I was using to retrieve the string: str(unicode(etree.tostring(root.xpath('path')[0],method='text', encoding='utf-8'),errors='ignore')).strip() With the result

Modify namespaces in a given xml document with lxml

阅读更多关于 Modify namespaces in a given xml document with lxml

问题 I have an xml-document that looks like this: <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://someurl/Oldschema" xsi:schemaLocation="http://someurl/Oldschema Oldschema.xsd" xmlns:framework="http://someurl/Oldframework"> <framework:tag1> ... </framework:tag1> <framework:tag2> <tagA> ... </tagA> </framwork:tag2> </root> All I want to do is change http://someurl/Oldschema to http://someurl/Newschema and http://someurl/Oldframework to http://someurl/Newframework and leave

Python语言

阅读更多关于 Python语言

Python语言 | 飞熊在天 Python语言发表于 2012 年 3 月 5 日由 raphaelzhang 如果把高中AppleⅡ上用过的Basic算起，我用过的编程语言应该有十种以上了。其中工作中用过的有C，C++，pascal(OP/Delphi)，java，C#，basic(VB)，Unix/Linux shell(awk)，perl，python，PHP，javascript，所谓工作中用过的，就是我凭这些语言写的程序赚过钱的。另外自己捣鼓过的还有haskell，F#，scala，Go，D，Object-C，汇编，eiffel等，至于erlang，lisp，prolog，lua，ruby，dart等语言，那就只见过demo程序，自己没写过了。当然，像html，css，xml/xslt，bat，jsp/asp，SQL这样的语言没有被我算在编程语言里。就我个人来说，我最喜欢Python语言和C语言。 D语言看上去不错，不过没什么前途， Go语言也挺好，也许以后会多用，但是现在首先是还没出1.0版(计划2012上半年会出)，而且现在Windows上的实现不行，先等会。 Python语言的好处在于表现力强，兼库多且给力。而C语言的好处在于对底层的抽象不多也不少。它们都有简洁的优点，而且不像Perl的简洁，Python代码阅读起来很容易懂

How to include the namespaces into a xml file using lxml?

阅读更多关于 How to include the namespaces into a xml file using lxml?

问题 I am creating a new xml file from scratch using python and the lxml library. <route xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.xxxx" version="1.1" xmlns:stm="http://xxxx/1/0/0" xsi:schemaLocation="http://xxxx/1/0/0 stm_extensions.xsd"> I need to include this namespace information into the root tag as attributes of the route tag. I can´t include the information into the root declaration. from lxml import etree root = etree.Element("route", xmlns:xsi = "http://www

lxml truncates text that contains 'less than' character

阅读更多关于 lxml truncates text that contains 'less than' character

问题 >>> s = '<div> < 20 </div>' >>> import lxml.html >>> tree = lxml.html.fromstring(s) >>> lxml.etree.tostring(tree) '<div> </div>' Does anybody know any workaround for this? 回答1: Your HTML input is broken; that < left angle bracket should have been encoded to < instead. From the lxml documentation on parsing broken HTML: The support for parsing broken HTML depends entirely on libxml2's recovery algorithm. It is not the fault of lxml if you find documents that are so heavily broken that the

lxml truncates text that contains 'less than' character

阅读更多关于 lxml truncates text that contains 'less than' character

XPath: select tag with empty value

阅读更多关于 XPath: select tag with empty value

问题 How I can find in XPath 1.0 all rows with empty col name="POW" ? <row> <col name="WOJ">02</col> <col name="POW"/> <col name="GMI"/> <col name="RODZ"/> <col name="NAZWA">DOLNOŚLĄSKIE</col> <col name="NAZDOD">województwo</col> <col name="STAN_NA">2011-01-01</col> </row> I tried many solutions. Few times in Firefox extension XPath Checker selection was ok, but lxml.xpath() says that expression is invalid or just returns no rows. My Python code: from lxml import html f = open('TERC.xml', 'r')

Extracting XML into data frame with parent attribute as column title

阅读更多关于 Extracting XML into data frame with parent attribute as column title

问题 I have thousands of XML files that I will be processing, and they have a similar format, but different parent names and different numbers of parents. Through books, google, tutorials, and just trying out codes, I've been able to pull out all of this data. See, for example: Parsing xml to pandas data frame throws memory error and Dynamic search through xml attributes using lxml and xpath in python However, I realized that I was extracting the data poorly, with a child "Time" repeated for each