lxml | 易学教程

Xpath extract current node content including all child node

阅读更多关于 Xpath extract current node content including all child node

问题 I've met a problem while extracting current node content including all child node. Just like the following code, I want to get string abcdefgb1b2b3 in pre tag. But I could not use "child::*" to get it. If I use "/text()", I lost b tag format information. Please help me out. # -*- coding: utf-8 -*- from lxml import html import lxml.etree as le input = "<pre>abcdefgb1b2b3</pre>" input_xpath = "//pre/child::*" tree = html.fromstring(input) result = tree.xpath(input_xpath) result1 =

Remove “xmlns:py…” with lxml.objectify

阅读更多关于 Remove “xmlns:py…” with lxml.objectify

问题 I just discovered lxml.objectify which seems nice and easy for reading/writing simple XML files. Firstly, is it a good idea to use lxml.objectify ? For instance is it mature and still developed and likely to be available in the future? Secondly, how do I prevent objectify from addding markup like xmlns:py="http://codespeak.net/lxml/objectify/pytype" py:pytype="str" in the output below ?. Input : config.xml <?xml version="1.0" encoding="utf-8"?> <Test> <MyElement1>sdfsdfdsfd</MyElement1> <

Parse XHTML5 with undefined entities

阅读更多关于 Parse XHTML5 with undefined entities

问题 Please consider this: import xml.etree.ElementTree as ET xhtml = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"> <head><title>XHTML sample</title></head> <body> Sample text </body> </html> ''' parser = ET.XMLParser() parser.entity['nbsp'] = ' ' tree = ET.fromstring(xhtml, parser=parser) print(ET.tostring(tree, method='xml')) which renders nice text

Parsing with lxml xpath

阅读更多关于 Parsing with lxml xpath

问题 I was trying to implement a lxml, xpath code to parse html from link: https://www.theice.com/productguide/ProductSpec.shtml?specId=251 Specifically, I was trying to parse the <tr class="last"> table at near the end of the page. I wanted to obtain the text in that sub-table, for example: "New York" and the hours listed next to it (and do the same for London and Singapore) . I have the following code (which doesn't work properly): doc = lxml.html.fromstring(page) tds = doc.xpath('//table[@class

Prevent python lxml from adding plain text a tag

阅读更多关于 Prevent python lxml from adding plain text a tag

问题 I don't want lxml add anything to plain text. I left them as they are on purpose. lxml adds plain text a tag. Here value might be html or plaintext. I need lxml to process html and leave plaintext along. import lxml.html mixed = ['plaintext', '<a>HTML</a>', '<a>HTML</a>'] for text in mixed: html = lxml.html.fromstring(text) print(lxml.html.tostring(html)) The output: b'plaintext' b'<a>HTML</a>' b'<a>HTML</a>' What I need is: b'plaintext' b'<a>HTML</a>' b'<a>HTML</a>' So I come up

Python lxml: Ignore XML declaration (errors)

阅读更多关于 Python lxml: Ignore XML declaration (errors)

问题 I am trying to parse the file browser Thunar's custom actions files ( ~/.config/Thunar/uca.xml ) with the lxml Python module. For some reason, Thunar obviously writes a malformed declaration into these files: <?xml encoding="UTF-8" version="1.0"?> Obviously, the version is expected to appear as the first "attribute" in the declaration. lxml raises an XMLSyntaxError if I try to parse the file. And no, I cannot simply correct the declaration, becaue Thunar keeps overwriting it with the bogus

Python lxml (objectify): Xpath troubles

阅读更多关于 Python lxml (objectify): Xpath troubles

问题 I am attempting to parse an xml document, extracting data using lxml objectify and xpath. Here is a snip of the document: <?xml version="1.0" encoding="UTF-8"?> <Assets> <asset name="Adham"> <pos> <x>27913.769923</x> <y>5174.627773</y> </pos> <description>Ba bla bla</description> <bar>(null)</bar> </general> </asset> <asset name="Adrian"> <pos> <x>-179.477707</x> <y>5286.959359</y> </pos> <commodities/> <description>test test test</description> <bar>more bla</bar> </general> </asset> </Assets

convert lxml to scrapy xxs selector

阅读更多关于 convert lxml to scrapy xxs selector

问题 How can I convert this pure python lxml to scrapy built in xxs selectors? This one works but i want to convert this to the scrapy xxs selectors. def parse_device_list(self, response): self.log("\n\n\n List of devices \n\n\n") self.log('Hi, this is the parse_device_list page! %s' % response.url) root = lxml.etree.fromstring(response.body) for row in root.xpath('//row'): allcells = row.xpath('./cell') # first cell contain the link to follow detail_page_link = allcells[0].get("href") yield

Parsing XPath within non standard XML using lxml Python

阅读更多关于 Parsing XPath within non standard XML using lxml Python

问题 I’m trying to create a database of all patent information from Google Patents. Much of my work so far has been using this very good answer from MattH in Python to parse non-standard XML file. My Python is too large to display so its linked here. The source files are here: a bunch of xml files appended together into one file with multiple headers.The issue is trying to use the correct xpath expression when parsing this unsual "non-standard" XML file which has multiple xml and dtd declarations.

python lxml: how to get text from a element which has a child element

阅读更多关于 python lxml: how to get text from a element which has a child element

问题 I want to extract sometext from the html code, but the following doesn't r eturn sometext, instead it return "\n". So how to get sometest? a=html.fromstring(""" sometext """) a.find(".//i").getparent().text 回答1: Instead of .text , use text_content() method: In [5]: a.find(".//i").getparent().text_content().strip() Out[5]: 'sometext' Or, you can get to the following text sibling of the i element: In [6]: a.xpath(".//i/following-sibling::text()")