lxml

Filtering out certain bytes in python

匆匆过客 提交于 2019-11-29 11:53:27
问题 I'm getting this error in my python program: ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters This question, random text from /dev/random raising an error in lxml: All strings must be XML compatible: Unicode or ASCII, no NULL bytes, explains the issue. The solution was to filter out certain bytes, but I'm confused about how to go about doing this. Any help? Edit: sorry if i didn't give enough info about the problem. the string data comes

stripping inline tags with python's lxml

梦想的初衷 提交于 2019-11-29 11:37:29
I have to deal with two types of inline tags in xml documents. The first type of tags enclose text that I want to keep in-between. I can deal with this with lxml's etree.tostring(element, method="text", encoding='utf-8') The second type of tags include text that I don't want to keep. How can I get rid of these tags and their text? I would prefer not to use regular expressions, if possible. Thanks Mark Longair I think that strip_tags and strip_elements are what you want in each case. For example, this script: from lxml import etree text = "<x>hello, <z>keep me</z> and <y>ignore me</y>, and here

import lxml fails on OSX after (seemingly) successful install

戏子无情 提交于 2019-11-29 11:28:55
I'm trying to install lxml for python on OS X 10.6.8 I ran sudo env ARCHFLAGS="-arch i386 -arch x86_64" easy_install lxml in the terminal based on this answer to a question installing lxml: https://stackoverflow.com/a/6545556/216336 This was the output of that command: MYCOMPUTER:~ MYUSERNAME$ sudo env ARCHFLAGS="-arch i386 -arch x86_64" easy_install lxml Password: Searching for lxml Reading http://pypi.python.org/simple/lxml/ Reading http://codespeak.net/lxml Best match: lxml 2.3.3 Downloading http://lxml.de/files/lxml-2.3.3.tgz Processing lxml-2.3.3.tgz Running lxml-2.3.3/setup.py -q bdist

Obtaining position info when parsing HTML in Python

依然范特西╮ 提交于 2019-11-29 10:59:11
I'm trying to find a way to parse (potentially malformed) HTML in Python and, if a set of conditions are met, output that piece of the document with the position (line, column). The position information is what is tripping me up here. And to be clear, I have no need to build an object tree. I simply want to find certain pieces of data and their position in the original document (think of a spell checker, for example: 'word "foo" at line x, column y, is misspelled)' As an example I want something like this (using ElementTree's Target API ): import xml.etree.ElementTree as ET class EchoTarget:

lxml memory usage when parsing huge xml in python

若如初见. 提交于 2019-11-29 10:47:45
问题 I am a python newbie. I am trying to parse a huge xml file in my python module using lxml. In spite of clearing the elements at the end of each loop, my memory shoots up and crashes the application. I am sure I am missing something here. Please helpme figure out what that is. Following are main functions I am using - from lxml import etree def parseXml(context,attribList): for _, element in context: fieldMap={} rowList=[] readAttribs(element,fieldMap,attribList) readAllChildren(element

requests+lxml爬虫利器

白昼怎懂夜的黑 提交于 2019-11-29 10:15:50
requests 1.requests是一个强大的Python第三方Http库,基于httplib和urllib3,接口清晰易用,功能十分强大。 ###1. 安装 pip install requests或者easy_install requests ###2. 基本使用 在ipython中利用自动补全看下调用requests之后返回的response对象的一些属性: In [1]: import requests In [2]: r = requests.get('https://api.github.com') In [3]: r. r.apparent_encoding r.history r.raw r.close r.is_redirect r.reason r.connection r.iter_content r.request r.content r.iter_lines r.status_code r.cookies r.json r.text r.elapsed r.links r.url r.encoding r.ok r.headers r.raise_for_status 快速入门: http://requests-docs-cn.readthedocs.io/zh_CN/latest/user/quickstart.html 高级的用法: http:/

French and lxml text

久未见 提交于 2019-11-29 10:08:46
I'm trying to assign a valid French text string to a text string using lxml: el = etree.Element("someelement") el.text = 'Disponible à partir du 1er Octobre' I get the error: ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters I've also tried: el.ext = etree.CDATA('Disponible à partir du 1er Octobre') However I get the same error. How do I handle French in XML, in particular, ISO-8859-1? There are ways to specify encoding within the tostring() function in lxml, but not for assigning text values within elements. jfs If text contains non-ascii

Entity references and lxml

蹲街弑〆低调 提交于 2019-11-29 09:59:15
Here's the code I have: from cStringIO import StringIO from lxml import etree xml = StringIO('''<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE root [ <!ENTITY test "This is a test"> ]> <root> <sub>&test;</sub> </root>''') d1 = etree.parse(xml) print '%r' % d1.find('/sub').text parser = etree.XMLParser(resolve_entities=False) d2 = etree.parse(xml, parser=parser) print '%r' % d2.find('/sub').text Here's the output: 'This is a test' None How do I get lxml to give me '&test;' , i.e., the raw entity reference? The "unresolved" Entity is left as child node of the element node sub >>> print d2.find

Change text value with lxml

[亡魂溺海] 提交于 2019-11-29 09:58:53
问题 I have an xml file - here is a snippet.. <gmd_fileIdentifier> <gco_CharacterString>{0328cb65-b564-495a-b17e-e49e04864ab7}</gco_CharacterString> </gmd_fileIdentifier> <gmd_identifier> <gmd_RS_Identifier> <gmd_authority gco_nilReason="missing" /> <gmd_code> <gco_CharacterString>0000</gco_CharacterString> </gmd_code> <gmd_codeSpace xmlns:gml="http://www.opengis.net/gml" xmlns:msxsl="urn:schemas-microsoft-com:xslt"> <gco_CharacterString>test</gco_CharacterString> </gmd_codeSpace> </gmd_RS

ubuntu 11.04 lxml import etree problem for custom python

ⅰ亾dé卋堺 提交于 2019-11-29 09:26:19
ubuntu 11.04 has native python2.7 i build python2.5 from source to /usr/local/python2.5/bin, and try to install lxml for my custom python2.5 install. Also i use virtualenv. I switch to my env with python2.5. On import lxml i got an error. from lxml import etree ImportError: /home/se7en/.virtualenvs/e-py25/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so: undefined symbol: PyUnicodeUCS2_DecodeLatin1 With python2.7 env, all is ok but on python2.5 import fails. Please help to fix for python2.5 ? ldd /home/se7en/.virtualenvs/e-py25/lib/python2.5/site-packages/lxml-2.2.4