lxml

HTML encoding and lxml parsing

阅读更多关于 HTML encoding and lxml parsing

I'm trying to finally solve some encoding issues that pop up from trying to scrape HTML with lxml. Here are three sample HTML documents that I've encountered: 1. <!DOCTYPE html> <html lang='en'> <head> <title>Unicode Chars: 은 —’</title> <meta charset='utf-8'> </head> <body></body> </html> 2. <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR"> <head> <title>Unicode Chars: 은 —’</title> <meta http-equiv="content-type" content="text/html; charset=utf-8" /> </head> <body></body> </html> 3. <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C/

BeautifulSoup and lxml.html - what to prefer? [duplicate]

阅读更多关于 BeautifulSoup and lxml.html - what to prefer? [duplicate]

This question already has an answer here: Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes? 7 answers I am working on a project that will involve parsing HTML. After searching around, I found two probable options: BeautifulSoup and lxml.html Is there any reason to prefer one over the other? I have used lxml for XML some time back and I feel I will be more comfortable with it, however BeautifulSoup seems to be much common. I know I should use the one that works for me, but I was looking for personal experiences with both. The simple answer, imo

Py2exe lxml woes

阅读更多关于 Py2exe lxml woes

I have a wxpython application that depends on lxml and works well when running it through the python interpreter. However, when creating an exe with py2exe, I got this error ImportError: No module named _elementpath I then used python setup.py py2exe -p lxml and I did not get the above error but another one saying ImportError: No module named gzip Could anyone let me know what the problem is and how I can fix it. Also should I put any dll files like libxml2, libxslt etc in my dist folder? I searched the computer and did not find these files, so maybe they aren't needed? Thanks. Edit: I just

python's lxml and iterparse method

阅读更多关于 python's lxml and iterparse method

问题 Say i have this sample XML. <result> <field k='field1'> <value h='1'><text>text_value1</text></value> </field> <field k='field2'> <value><text>text_value2</text></value> </field> <field k='field3'> <value><text>some_text</text></value> </field> </result> Using python's lxml, how can i get the value of each field for every result set? So basically, i want to iterate over ever result set, then iterate over every field in that result set and print the text data. This is what i have so far:

Efficient way to iterate through xml elements

阅读更多关于 Efficient way to iterate through xml elements

I have a xml like this: <a> hello world </a> <x> <y></y> </x> <a> first second third </a> I need to iterate through all <a> and tags, but I don't know how many of them are in document. So I use xpath to handle that: from lxml import etree doc = etree.fromstring(xml) atags = doc.xpath('//a') for a in atags: btags = a.xpath('b') for b in btags: print b It works, but I have pretty big files, and cProfile shows me that xpath is very expensive to use. I wonder, maybe there is there more efficient way to iterate through indefinitely number of xml-elements?

How to find recursively for a tag of XML using LXML?

阅读更多关于 How to find recursively for a tag of XML using LXML?

<?xml version="1.0" ?> <data> <test > <f1 /> </test > <test2 > <test3> <f1 /> </test3> </test2> <f1 /> </data> Using lxml is it possible to find recursively for tag " f1 "? I tried findall method but it works only for immediate children. I think I should go for BeautifulSoup for this !!! You can use XPath to search recursively: >>> from lxml import etree >>> q = etree.fromstring('<xml><hello>a</hello><x><hello>b</hello></x></xml>') >>> q.findall('hello') # Tag name, first level only. [<Element hello at 414a7c8>] >>> q.findall('.//hello') # XPath, recursive. [<Element hello at 414a7c8>,

stripping inline tags with python's lxml

阅读更多关于 stripping inline tags with python's lxml

问题 I have to deal with two types of inline tags in xml documents. The first type of tags enclose text that I want to keep in-between. I can deal with this with lxml's etree.tostring(element, method="text", encoding='utf-8') The second type of tags include text that I don't want to keep. How can I get rid of these tags and their text? I would prefer not to use regular expressions, if possible. Thanks 回答1: I think that strip_tags and strip_elements are what you want in each case. For example, this

import lxml fails on OSX after (seemingly) successful install

阅读更多关于 import lxml fails on OSX after (seemingly) successful install

问题 I'm trying to install lxml for python on OS X 10.6.8 I ran sudo env ARCHFLAGS="-arch i386 -arch x86_64" easy_install lxml in the terminal based on this answer to a question installing lxml: https://stackoverflow.com/a/6545556/216336 This was the output of that command: MYCOMPUTER:~ MYUSERNAME$ sudo env ARCHFLAGS="-arch i386 -arch x86_64" easy_install lxml Password: Searching for lxml Reading http://pypi.python.org/simple/lxml/ Reading http://codespeak.net/lxml Best match: lxml 2.3.3

阅读更多关于 lxml

lxml #安装 pip3 install lxml ( pip install lxml ) #导入 from lxml import etree # https://www.cnblogs.com/gaochsh/p/6757475.html #XPath的基本语法知识： 1) // 双斜杠定位根节点，会对全文进行扫描，在文档中选取所有符合条件的内容，以列表的形式返回。 2) / 单斜杠寻找当前标签路径的下一层路径标签或者对当前路标签内容进行操作 3) /text() 获取当前路径下的文本内容 4) /@xxxx 提取当前路径下标签的属性值 5) | 可选符使用|可选取若干个路径如//p | //div 即在当前路径下选取所有符合条件的p标签和div标签。 6) . 点用来选取当前节点 7) .. 双点选取当前节点的父节点 #另外还有starts-with(@属性名称,属性字符相同部分)，string(.)两种重要的特殊方法后面将重点讲来源： https://www.cnblogs.com/pengyy/p/11392100.html

Obtaining position info when parsing HTML in Python

阅读更多关于 Obtaining position info when parsing HTML in Python

问题 I'm trying to find a way to parse (potentially malformed) HTML in Python and, if a set of conditions are met, output that piece of the document with the position (line, column). The position information is what is tripping me up here. And to be clear, I have no need to build an object tree. I simply want to find certain pieces of data and their position in the original document (think of a spell checker, for example: 'word "foo" at line x, column y, is misspelled)' As an example I want