lxml

How can I parse HTML with html5lib, and query the parsed HTML with XPath?

≡放荡痞女 提交于 2019-12-18 11:46:48
问题 I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a table: <html> <table> <tr><td>Header</td></tr> <tr><td>Want This</td></tr> </table> </html> so lets try it: >>> doc = html5lib.parse('<html><table><tr><td>Header</td></tr><tr><td>Want This</td> </tr></table></html>', treebuilder='lxml') >>> doc <lxml

Generating xml in python and lxml

点点圈 提交于 2019-12-18 10:52:39
问题 I have this xml from sql, and I want to do the same by python 2.7 and lxml <?xml version="1.0" encoding="utf-16"?> <results> <Country name="Germany" Code="DE" Storage="Basic" Status="Fresh" Type="Photo" /> </results> Now I have: from lxml import etree # create XML results= etree.Element('results') country= etree.Element('country') country.text = 'Germany' root.append(country) filename = "xmltestthing.xml" FILE = open(filename,"w") FILE.writelines(etree.tostring(root, pretty_print=True)) FILE

Character encoding in python to replace 'u2019' with '

跟風遠走 提交于 2019-12-18 09:39:40
问题 I have tried numerous ways to encode this to the end result "BACK RUSHIN'" with the most important character being the right apostrophe ' . I would like a way of getting to this end result using some of the built in functions Python has where there is no discrimination between a normal string and a unicode string. This was the code I was using to retrieve the string: str(unicode(etree.tostring(root.xpath('path')[0],method='text', encoding='utf-8'),errors='ignore')).strip() With the result

Modify namespaces in a given xml document with lxml

廉价感情. 提交于 2019-12-18 09:33:41
问题 I have an xml-document that looks like this: <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://someurl/Oldschema" xsi:schemaLocation="http://someurl/Oldschema Oldschema.xsd" xmlns:framework="http://someurl/Oldframework"> <framework:tag1> ... </framework:tag1> <framework:tag2> <tagA> ... </tagA> </framwork:tag2> </root> All I want to do is change http://someurl/Oldschema to http://someurl/Newschema and http://someurl/Oldframework to http://someurl/Newframework and leave

Python语言

爱⌒轻易说出口 提交于 2019-12-18 08:50:57
Python语言 | 飞熊在天 Python语言 发表于 2012 年 3 月 5 日 由 raphaelzhang 如果把高中AppleⅡ上用过的Basic算起,我用过的编程语言应该有十种以上了。 其中工作中用过的有C,C++,pascal(OP/Delphi),java,C#,basic(VB),Unix/Linux shell(awk),perl,python,PHP,javascript,所谓工作中用过的,就是我凭这些语言写的程序赚过钱的。另外自己捣鼓过的还有haskell,F#,scala,Go,D,Object-C,汇编,eiffel等,至于erlang,lisp,prolog,lua,ruby,dart等语言,那就只见过demo程序,自己没写过了。当然,像html,css,xml/xslt,bat,jsp/asp,SQL这样的语言没有被我算在编程语言里。 就我个人来说,我最喜欢Python语言和C语言。 D语言 看上去不错,不过没什么前途, Go语言 也挺好,也许以后会多用,但是现在首先是还没出1.0版(计划2012上半年会出),而且现在Windows上的实现不行,先等会。 Python语言的好处在于表现力强,兼库多且给力。而C语言的好处在于对底层的抽象不多也不少。它们都有简洁的优点,而且不像Perl的简洁,Python代码阅读起来很容易懂

How to include the namespaces into a xml file using lxml?

梦想的初衷 提交于 2019-12-18 07:23:30
问题 I am creating a new xml file from scratch using python and the lxml library. <route xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.xxxx" version="1.1" xmlns:stm="http://xxxx/1/0/0" xsi:schemaLocation="http://xxxx/1/0/0 stm_extensions.xsd"> I need to include this namespace information into the root tag as attributes of the route tag. I can´t include the information into the root declaration. from lxml import etree root = etree.Element("route", xmlns:xsi = "http://www

lxml truncates text that contains 'less than' character

独自空忆成欢 提交于 2019-12-18 06:21:23
问题 >>> s = '<div> < 20 </div>' >>> import lxml.html >>> tree = lxml.html.fromstring(s) >>> lxml.etree.tostring(tree) '<div> </div>' Does anybody know any workaround for this? 回答1: Your HTML input is broken; that < left angle bracket should have been encoded to < instead. From the lxml documentation on parsing broken HTML: The support for parsing broken HTML depends entirely on libxml2's recovery algorithm. It is not the fault of lxml if you find documents that are so heavily broken that the

lxml truncates text that contains 'less than' character

混江龙づ霸主 提交于 2019-12-18 06:21:10
问题 >>> s = '<div> < 20 </div>' >>> import lxml.html >>> tree = lxml.html.fromstring(s) >>> lxml.etree.tostring(tree) '<div> </div>' Does anybody know any workaround for this? 回答1: Your HTML input is broken; that < left angle bracket should have been encoded to < instead. From the lxml documentation on parsing broken HTML: The support for parsing broken HTML depends entirely on libxml2's recovery algorithm. It is not the fault of lxml if you find documents that are so heavily broken that the

XPath: select tag with empty value

£可爱£侵袭症+ 提交于 2019-12-18 04:44:16
问题 How I can find in XPath 1.0 all rows with empty col name="POW" ? <row> <col name="WOJ">02</col> <col name="POW"/> <col name="GMI"/> <col name="RODZ"/> <col name="NAZWA">DOLNOŚLĄSKIE</col> <col name="NAZDOD">województwo</col> <col name="STAN_NA">2011-01-01</col> </row> I tried many solutions. Few times in Firefox extension XPath Checker selection was ok, but lxml.xpath() says that expression is invalid or just returns no rows. My Python code: from lxml import html f = open('TERC.xml', 'r')

Extracting XML into data frame with parent attribute as column title

主宰稳场 提交于 2019-12-18 04:26:16
问题 I have thousands of XML files that I will be processing, and they have a similar format, but different parent names and different numbers of parents. Through books, google, tutorials, and just trying out codes, I've been able to pull out all of this data. See, for example: Parsing xml to pandas data frame throws memory error and Dynamic search through xml attributes using lxml and xpath in python However, I realized that I was extracting the data poorly, with a child "Time" repeated for each