lxml

Xpath extract current node content including all child node

杀马特。学长 韩版系。学妹 提交于 2019-12-11 09:43:01
问题 I've met a problem while extracting current node content including all child node. Just like the following code, I want to get string abcdefg<b>b1b2b3</b> in pre tag. But I could not use "child::*" to get it. If I use "/text()", I lost b tag format information. Please help me out. # -*- coding: utf-8 -*- from lxml import html import lxml.etree as le input = "<pre>abcdefg<b>b1b2b3</b></pre>" input_xpath = "//pre/child::*" tree = html.fromstring(input) result = tree.xpath(input_xpath) result1 =

Remove “xmlns:py…” with lxml.objectify

寵の児 提交于 2019-12-11 09:23:15
问题 I just discovered lxml.objectify which seems nice and easy for reading/writing simple XML files. Firstly, is it a good idea to use lxml.objectify ? For instance is it mature and still developed and likely to be available in the future? Secondly, how do I prevent objectify from addding markup like xmlns:py="http://codespeak.net/lxml/objectify/pytype" py:pytype="str" in the output below ?. Input : config.xml <?xml version="1.0" encoding="utf-8"?> <Test> <MyElement1>sdfsdfdsfd</MyElement1> <

Parse XHTML5 with undefined entities

岁酱吖の 提交于 2019-12-11 09:06:21
问题 Please consider this: import xml.etree.ElementTree as ET xhtml = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"> <head><title>XHTML sample</title></head> <body> <p> Sample text</p> </body> </html> ''' parser = ET.XMLParser() parser.entity['nbsp'] = ' ' tree = ET.fromstring(xhtml, parser=parser) print(ET.tostring(tree, method='xml')) which renders nice text

Parsing with lxml xpath

可紊 提交于 2019-12-11 08:56:03
问题 I was trying to implement a lxml, xpath code to parse html from link: https://www.theice.com/productguide/ProductSpec.shtml?specId=251 Specifically, I was trying to parse the <tr class="last"> table at near the end of the page. I wanted to obtain the text in that sub-table, for example: "New York" and the hours listed next to it (and do the same for London and Singapore) . I have the following code (which doesn't work properly): doc = lxml.html.fromstring(page) tds = doc.xpath('//table[@class

Prevent python lxml from adding plain text a <p> tag

你。 提交于 2019-12-11 08:44:08
问题 I don't want lxml add anything to plain text. I left them as they are on purpose. lxml adds plain text a <p> tag. Here value might be html or plaintext. I need lxml to process html and leave plaintext along. import lxml.html mixed = ['plaintext', '<a>HTML</a>', '<a>HTML</a>'] for text in mixed: html = lxml.html.fromstring(text) print(lxml.html.tostring(html)) The output: b'<p>plaintext</p>' b'<a>HTML</a>' b'<a>HTML</a>' What I need is: b'plaintext' b'<a>HTML</a>' b'<a>HTML</a>' So I come up

Python lxml: Ignore XML declaration (errors)

一笑奈何 提交于 2019-12-11 08:42:05
问题 I am trying to parse the file browser Thunar's custom actions files ( ~/.config/Thunar/uca.xml ) with the lxml Python module. For some reason, Thunar obviously writes a malformed declaration into these files: <?xml encoding="UTF-8" version="1.0"?> Obviously, the version is expected to appear as the first "attribute" in the declaration. lxml raises an XMLSyntaxError if I try to parse the file. And no, I cannot simply correct the declaration, becaue Thunar keeps overwriting it with the bogus

Python lxml (objectify): Xpath troubles

╄→гoц情女王★ 提交于 2019-12-11 08:25:37
问题 I am attempting to parse an xml document, extracting data using lxml objectify and xpath. Here is a snip of the document: <?xml version="1.0" encoding="UTF-8"?> <Assets> <asset name="Adham"> <pos> <x>27913.769923</x> <y>5174.627773</y> </pos> <description>Ba bla bla</description> <bar>(null)</bar> </general> </asset> <asset name="Adrian"> <pos> <x>-179.477707</x> <y>5286.959359</y> </pos> <commodities/> <description>test test test</description> <bar>more bla</bar> </general> </asset> </Assets

convert lxml to scrapy xxs selector

对着背影说爱祢 提交于 2019-12-11 08:14:35
问题 How can I convert this pure python lxml to scrapy built in xxs selectors? This one works but i want to convert this to the scrapy xxs selectors. def parse_device_list(self, response): self.log("\n\n\n List of devices \n\n\n") self.log('Hi, this is the parse_device_list page! %s' % response.url) root = lxml.etree.fromstring(response.body) for row in root.xpath('//row'): allcells = row.xpath('./cell') # first cell contain the link to follow detail_page_link = allcells[0].get("href") yield

Parsing XPath within non standard XML using lxml Python

Deadly 提交于 2019-12-11 07:45:47
问题 I’m trying to create a database of all patent information from Google Patents. Much of my work so far has been using this very good answer from MattH in Python to parse non-standard XML file. My Python is too large to display so its linked here. The source files are here: a bunch of xml files appended together into one file with multiple headers.The issue is trying to use the correct xpath expression when parsing this unsual "non-standard" XML file which has multiple xml and dtd declarations.

python lxml: how to get text from a element which has a child element

人盡茶涼 提交于 2019-12-11 06:55:58
问题 I want to extract sometext from the html code, but the following doesn't r eturn sometext, instead it return "\n". So how to get sometest? a=html.fromstring(""" <p class="clearfix"> <i class="xueli"></i> sometext </p> """) a.find(".//i").getparent().text 回答1: Instead of .text , use text_content() method: In [5]: a.find(".//i").getparent().text_content().strip() Out[5]: 'sometext' Or, you can get to the following text sibling of the i element: In [6]: a.xpath(".//i/following-sibling::text()")