lxml | 易学教程

python install lxml on mac os 10.10.1

阅读更多关于 python install lxml on mac os 10.10.1

I bought a new macbook and I am so new to mac os. However, I read a lot on internet about how to install scrap I did everything, but i have a problem with installing lxml I tried this on terminal pip install lxml and a lot of stuff started to be downloading and many text was written on the terminal, but i got this error message on red in the terminal 1 error generated. error: command '/usr/bin/clang' failed with exit status 1 ---------------------------------------- Cleaning up... Command /Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python -c "import

python [lxml] - cleaning out html tags

阅读更多关于 python [lxml] - cleaning out html tags

from lxml.html.clean import clean_html, Cleaner def clean(text): try: cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True, remove_tags = ['a', 'li', 'td']) print (len(cleaner.clean_html(text))- len(text)) return cleaner.clean_html(text) except: print 'Error in clean_html' print sys.exc_info() return text I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html pages, so in the end i am just left with the text and nothing else - but try as i might, the above doesnt

网络爬虫05: BesutifulSoup库详解

阅读更多关于网络爬虫05: BesutifulSoup库详解

BeautifulSoup 1.什么是BeautifulSoup 灵活又方便的网页解析库，处理高效，支持多种解析器。利用它不用编写正则表达式即可方便地实现网页信息的提取 2.安装BeautifulSoup pip3 install lxml pip3 install BeautifulSoup4 3.解析库解析器使用方法优势劣势 Python标准库 BeautifulSoup(markup, "html.parser") Python的内置标准库、执行速度适中、文档容错能力强 Python 2.7.3 or 3.2.2)前的版本中文容错能力差 lxml HTML 解析器 BeautifulSoup(markup, "lxml") 速度快、文档容错能力强需要安装C语言库 lxml XML 解析器 BeautifulSoup(markup, "xml") 速度快、唯一支持XML的解析器需要安装C语言库 html5lib BeautifulSoup(markup, "html5lib") 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档速度慢、不依赖外部扩展基本使用 html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name=

In lxml, how do I remove a tag but retain all contents?

阅读更多关于 In lxml, how do I remove a tag but retain all contents?

The problem is this: I have an XML fragment like so: <fragment>text1 <a>inner1 </a>text2 inner2 <c>t</c>ext3</fragment> For the result, I want to remove all <a> - and <c> -Tags, but retain their (text)-contents, and childnodes just as they are. Also, the -Element should be left untouched. The result should then look thus <fragment>text1 inner<d>1</d> text2 inner2 text3</fragment> For the time being, I'll revert to a very dirty trick: I'll etree.tostring the fragment, remove the offending tags via regex, and replace the original fragment with the etree.fromstring result of

How to install lxml for python without administative rights on linux?

阅读更多关于 How to install lxml for python without administative rights on linux?

问题 I just need some packages which dont present at the host machine (and I and linux... we... we didn't spend much time together...). I used to install them like: # from the source python setup.py install --user or # with easy_install easy_install prefix=~/.local package But it doesn't work with lxml. I get a lot of errors during the build: x:~/lxml-2.3$ python setup.py build Building lxml version 2.3. Building without Cython. ERROR: /bin/sh: xslt-config: command not found ** make sure the

Python sax to lxml for 80+GB XML

阅读更多关于 Python sax to lxml for 80+GB XML

How would you read an XML file using sax and convert it to a lxml etree.iterparse element? To provide an overview of the problem, I have built an XML ingestion tool using lxml for an XML feed that will range in the size of 25 - 500MB that needs ingestion on a bi-daily basis, but needs to perform a one time ingestion of a file that is 60 - 100GB's. I had chosen to use lxml based on the specifications that detailed a node would not exceed 4 -8 GB's in size which I thought would allow the node to be read into memory and cleared when finished. An overview if the code is below elements = etree

Escape unescaped characters in XML with Python

阅读更多关于 Escape unescaped characters in XML with Python

问题 I need to escape special characters in an invalid XML file which is about 5000 lines long. Here's an example of the XML that I have to deal with: <root> <element> <name>name & surname</name> <mail>name@name.org</mail> </element> </root> Here the problem is the character "&" in the name. How would you escape special characters like this with a Python library? I didn't find a way to do it with BeautifulSoup. 回答1: If you don't care about invalid characters in the xml you could use XML parser's

beautifulsoup won't recognize lxml

阅读更多关于 beautifulsoup won't recognize lxml

问题 I'm attempting to use lxml as the parser for BeautifulSoup because the default one is MUCH slower, however i'm getting this error: soup = BeautifulSoup(html, "lxml") File "/home/rob/python/stock/local/lib/python2.7/site-packages/bs4/__init__.py", line 152, in __init__ % ",".join(features)) bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library? I have uninstalled and reinstalled lxml as well as beautifulsoup many times,

How to not load the comments while parsing XML in lxml

阅读更多关于 How to not load the comments while parsing XML in lxml

问题 I try to parse XML file in Python using lxml like this: objectify.parse(xmlPath, parserWithSchema) but XML file may contains comments in strange places: <root> <text>Sample text</text>  <float>1.23456</float> </root> It is a way to not load or delete comments before parsing? 回答1: Set remove_comments=True on the parser (documentation): from lxml import etree, objectify parser = etree.XMLParser(remove_comments=True) tree = objectify.parse(xmlPath,

lxml runtime error: Reason: Incompatible library version: etree.so requires version 12.0.0 or later, but libxml2.2.dylib provides version 10.0.0

阅读更多关于 lxml runtime error: Reason: Incompatible library version: etree.so requires version 12.0.0 or later, but libxml2.2.dylib provides version 10.0.0

I have a perplexing problem. I have used mac version 10.9, anaconda 3.4.1, python 2.7.6. Developing web application with python-amazon-product-api. i have overcome an obstacle about installing lxml, referencing clang error: unknown argument: '-mno-fused-madd' (python package installation failure) . but another runtime error happened. Here is the output from webbrowser. Exception Type: ImportError Exception Value: dlopen(/Users/User_Name/Documents/App_Name/lib/python2.7/site-packages/lxml/etree.so, 2): Library not loaded: libxml2.2.dylib Referenced from: /Users/User_Name/Documents/App_Name/lib