lxml

python install lxml on mac os 10.10.1

こ雲淡風輕ζ 提交于 2019-11-28 00:01:48
I bought a new macbook and I am so new to mac os. However, I read a lot on internet about how to install scrap I did everything, but i have a problem with installing lxml I tried this on terminal pip install lxml and a lot of stuff started to be downloading and many text was written on the terminal, but i got this error message on red in the terminal 1 error generated. error: command '/usr/bin/clang' failed with exit status 1 ---------------------------------------- Cleaning up... Command /Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python -c "import

python [lxml] - cleaning out html tags

别来无恙 提交于 2019-11-27 22:59:31
from lxml.html.clean import clean_html, Cleaner def clean(text): try: cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True, remove_tags = ['a', 'li', 'td']) print (len(cleaner.clean_html(text))- len(text)) return cleaner.clean_html(text) except: print 'Error in clean_html' print sys.exc_info() return text I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html pages, so in the end i am just left with the text and nothing else - but try as i might, the above doesnt

网络爬虫05: BesutifulSoup库详解

 ̄綄美尐妖づ 提交于 2019-11-27 21:01:22
BeautifulSoup 1.什么是BeautifulSoup 灵活又方便的网页解析库,处理高效,支持多种解析器。利用它不用编写正则表达式即可方便地实现网页信息的提取 2.安装BeautifulSoup pip3 install lxml pip3 install BeautifulSoup4 3.解析库 解析器 使用方法 优势 劣势 Python标准库 BeautifulSoup(markup, "html.parser") Python的内置标准库、执行速度适中 、文档容错能力强 Python 2.7.3 or 3.2.2)前的版本中文容错能力差 lxml HTML 解析器 BeautifulSoup(markup, "lxml") 速度快、文档容错能力强 需要安装C语言库 lxml XML 解析器 BeautifulSoup(markup, "xml") 速度快、唯一支持XML的解析器 需要安装C语言库 html5lib BeautifulSoup(markup, "html5lib") 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 速度慢、不依赖外部扩展 基本使用 html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name=

In lxml, how do I remove a tag but retain all contents?

∥☆過路亽.° 提交于 2019-11-27 19:48:40
The problem is this: I have an XML fragment like so: <fragment>text1 <a>inner1 </a>text2 <b>inner2</b> <c>t</c>ext3</fragment> For the result, I want to remove all <a> - and <c> -Tags, but retain their (text)-contents, and childnodes just as they are. Also, the <b> -Element should be left untouched. The result should then look thus <fragment>text1 inner<d>1</d> text2 <b>inner2</b> text3</fragment> For the time being, I'll revert to a very dirty trick: I'll etree.tostring the fragment, remove the offending tags via regex, and replace the original fragment with the etree.fromstring result of

How to install lxml for python without administative rights on linux?

旧城冷巷雨未停 提交于 2019-11-27 18:58:30
问题 I just need some packages which dont present at the host machine (and I and linux... we... we didn't spend much time together...). I used to install them like: # from the source python setup.py install --user or # with easy_install easy_install prefix=~/.local package But it doesn't work with lxml. I get a lot of errors during the build: x:~/lxml-2.3$ python setup.py build Building lxml version 2.3. Building without Cython. ERROR: /bin/sh: xslt-config: command not found ** make sure the

Python sax to lxml for 80+GB XML

谁说胖子不能爱 提交于 2019-11-27 18:37:54
How would you read an XML file using sax and convert it to a lxml etree.iterparse element? To provide an overview of the problem, I have built an XML ingestion tool using lxml for an XML feed that will range in the size of 25 - 500MB that needs ingestion on a bi-daily basis, but needs to perform a one time ingestion of a file that is 60 - 100GB's. I had chosen to use lxml based on the specifications that detailed a node would not exceed 4 -8 GB's in size which I thought would allow the node to be read into memory and cleared when finished. An overview if the code is below elements = etree

Escape unescaped characters in XML with Python

蹲街弑〆低调 提交于 2019-11-27 18:27:32
问题 I need to escape special characters in an invalid XML file which is about 5000 lines long. Here's an example of the XML that I have to deal with: <root> <element> <name>name & surname</name> <mail>name@name.org</mail> </element> </root> Here the problem is the character "&" in the name. How would you escape special characters like this with a Python library? I didn't find a way to do it with BeautifulSoup. 回答1: If you don't care about invalid characters in the xml you could use XML parser's

beautifulsoup won't recognize lxml

一世执手 提交于 2019-11-27 17:57:31
问题 I'm attempting to use lxml as the parser for BeautifulSoup because the default one is MUCH slower, however i'm getting this error: soup = BeautifulSoup(html, "lxml") File "/home/rob/python/stock/local/lib/python2.7/site-packages/bs4/__init__.py", line 152, in __init__ % ",".join(features)) bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library? I have uninstalled and reinstalled lxml as well as beautifulsoup many times,

How to not load the comments while parsing XML in lxml

*爱你&永不变心* 提交于 2019-11-27 17:51:37
问题 I try to parse XML file in Python using lxml like this: objectify.parse(xmlPath, parserWithSchema) but XML file may contains comments in strange places: <root> <text>Sam<!--comment-->ple text</text> <!--comment--> <float>1.2<!--comment-->3456</float> </root> It is a way to not load or delete comments before parsing? 回答1: Set remove_comments=True on the parser (documentation): from lxml import etree, objectify parser = etree.XMLParser(remove_comments=True) tree = objectify.parse(xmlPath,

lxml runtime error: Reason: Incompatible library version: etree.so requires version 12.0.0 or later, but libxml2.2.dylib provides version 10.0.0

你。 提交于 2019-11-27 17:50:14
I have a perplexing problem. I have used mac version 10.9, anaconda 3.4.1, python 2.7.6. Developing web application with python-amazon-product-api. i have overcome an obstacle about installing lxml, referencing clang error: unknown argument: '-mno-fused-madd' (python package installation failure) . but another runtime error happened. Here is the output from webbrowser. Exception Type: ImportError Exception Value: dlopen(/Users/User_Name/Documents/App_Name/lib/python2.7/site-packages/lxml/etree.so, 2): Library not loaded: libxml2.2.dylib Referenced from: /Users/User_Name/Documents/App_Name/lib