lxml

HTML encoding and lxml parsing

只愿长相守 提交于 2019-11-28 06:05:00
I'm trying to finally solve some encoding issues that pop up from trying to scrape HTML with lxml. Here are three sample HTML documents that I've encountered: 1. <!DOCTYPE html> <html lang='en'> <head> <title>Unicode Chars: 은 —’</title> <meta charset='utf-8'> </head> <body></body> </html> 2. <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR"> <head> <title>Unicode Chars: 은 —’</title> <meta http-equiv="content-type" content="text/html; charset=utf-8" /> </head> <body></body> </html> 3. <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C/

BeautifulSoup and lxml.html - what to prefer? [duplicate]

穿精又带淫゛_ 提交于 2019-11-28 05:48:57
This question already has an answer here: Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes? 7 answers I am working on a project that will involve parsing HTML. After searching around, I found two probable options: BeautifulSoup and lxml.html Is there any reason to prefer one over the other? I have used lxml for XML some time back and I feel I will be more comfortable with it, however BeautifulSoup seems to be much common. I know I should use the one that works for me, but I was looking for personal experiences with both. The simple answer, imo

Py2exe lxml woes

喜你入骨 提交于 2019-11-28 05:22:14
I have a wxpython application that depends on lxml and works well when running it through the python interpreter. However, when creating an exe with py2exe, I got this error ImportError: No module named _elementpath I then used python setup.py py2exe -p lxml and I did not get the above error but another one saying ImportError: No module named gzip Could anyone let me know what the problem is and how I can fix it. Also should I put any dll files like libxml2, libxslt etc in my dist folder? I searched the computer and did not find these files, so maybe they aren't needed? Thanks. Edit: I just

python's lxml and iterparse method

霸气de小男生 提交于 2019-11-28 05:17:17
问题 Say i have this sample XML. <result> <field k='field1'> <value h='1'><text>text_value1</text></value> </field> <field k='field2'> <value><text>text_value2</text></value> </field> <field k='field3'> <value><text>some_text</text></value> </field> </result> Using python's lxml, how can i get the value of each field for every result set? So basically, i want to iterate over ever result set, then iterate over every field in that result set and print the text data. This is what i have so far:

Efficient way to iterate through xml elements

烈酒焚心 提交于 2019-11-28 05:04:45
I have a xml like this: <a> <b>hello</b> <b>world</b> </a> <x> <y></y> </x> <a> <b>first</b> <b>second</b> <b>third</b> </a> I need to iterate through all <a> and <b> tags, but I don't know how many of them are in document. So I use xpath to handle that: from lxml import etree doc = etree.fromstring(xml) atags = doc.xpath('//a') for a in atags: btags = a.xpath('b') for b in btags: print b It works, but I have pretty big files, and cProfile shows me that xpath is very expensive to use. I wonder, maybe there is there more efficient way to iterate through indefinitely number of xml-elements?

How to find recursively for a tag of XML using LXML?

别等时光非礼了梦想. 提交于 2019-11-28 04:50:38
<?xml version="1.0" ?> <data> <test > <f1 /> </test > <test2 > <test3> <f1 /> </test3> </test2> <f1 /> </data> Using lxml is it possible to find recursively for tag " f1 "? I tried findall method but it works only for immediate children. I think I should go for BeautifulSoup for this !!! You can use XPath to search recursively: >>> from lxml import etree >>> q = etree.fromstring('<xml><hello>a</hello><x><hello>b</hello></x></xml>') >>> q.findall('hello') # Tag name, first level only. [<Element hello at 414a7c8>] >>> q.findall('.//hello') # XPath, recursive. [<Element hello at 414a7c8>,

stripping inline tags with python's lxml

♀尐吖头ヾ 提交于 2019-11-28 04:47:15
问题 I have to deal with two types of inline tags in xml documents. The first type of tags enclose text that I want to keep in-between. I can deal with this with lxml's etree.tostring(element, method="text", encoding='utf-8') The second type of tags include text that I don't want to keep. How can I get rid of these tags and their text? I would prefer not to use regular expressions, if possible. Thanks 回答1: I think that strip_tags and strip_elements are what you want in each case. For example, this

import lxml fails on OSX after (seemingly) successful install

谁说我不能喝 提交于 2019-11-28 04:42:56
问题 I'm trying to install lxml for python on OS X 10.6.8 I ran sudo env ARCHFLAGS="-arch i386 -arch x86_64" easy_install lxml in the terminal based on this answer to a question installing lxml: https://stackoverflow.com/a/6545556/216336 This was the output of that command: MYCOMPUTER:~ MYUSERNAME$ sudo env ARCHFLAGS="-arch i386 -arch x86_64" easy_install lxml Password: Searching for lxml Reading http://pypi.python.org/simple/lxml/ Reading http://codespeak.net/lxml Best match: lxml 2.3.3

lxml

强颜欢笑 提交于 2019-11-28 04:19:24
lxml #安装 pip3 install lxml ( pip install lxml ) #导入 from lxml import etree # https://www.cnblogs.com/gaochsh/p/6757475.html #XPath的基本语法知识: 1) // 双斜杠 定位根节点,会对全文进行扫描,在文档中选取所有符合条件的内容,以列表的形式返回。 2) / 单斜杠 寻找当前标签路径的下一层路径标签或者对当前路标签内容进行操作 3) /text() 获取当前路径下的文本内容 4) /@xxxx 提取当前路径下标签的属性值 5) | 可选符 使用|可选取若干个路径 如//p | //div 即在当前路径下选取所有符合条件的p标签和div标签。 6) . 点 用来选取当前节点 7) .. 双点 选取当前节点的父节点 #另外还有starts-with(@属性名称,属性字符相同部分),string(.)两种重要的特殊方法后面将重点讲 来源: https://www.cnblogs.com/pengyy/p/11392100.html

Obtaining position info when parsing HTML in Python

僤鯓⒐⒋嵵緔 提交于 2019-11-28 04:06:15
问题 I'm trying to find a way to parse (potentially malformed) HTML in Python and, if a set of conditions are met, output that piece of the document with the position (line, column). The position information is what is tripping me up here. And to be clear, I have no need to build an object tree. I simply want to find certain pieces of data and their position in the original document (think of a spell checker, for example: 'word "foo" at line x, column y, is misspelled)' As an example I want