lxml | 易学教程

French and lxml text

阅读更多关于 French and lxml text

问题 I'm trying to assign a valid French text string to a text string using lxml: el = etree.Element("someelement") el.text = 'Disponible Ã partir du 1er Octobre' I get the error: ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters I've also tried: el.ext = etree.CDATA('Disponible Ã partir du 1er Octobre') However I get the same error. How do I handle French in XML, in particular, ISO-8859-1? There are ways to specify encoding within the tostring()

How to Pretty Print HTML to a file, with indentation

阅读更多关于 How to Pretty Print HTML to a file, with indentation

I am using lxml.html to generate some HTML. I want to pretty print (with indentation) my final result into an html file. How do I do that? This is what I have tried and got till now (I am relatively new to Python and lxml) : import lxml.html as lh from lxml.html import builder as E sliderRoot=lh.Element("div", E.CLASS("scroll"), style="overflow-x: hidden; overflow-y: hidden;") scrollContainer=lh.Element("div", E.CLASS("scrollContainer"), style="width: 4340px;") sliderRoot.append(scrollContainer) print lh.tostring(sliderRoot, pretty_print = True, method="html") As you can see I am using the

ubuntu 11.04 lxml import etree problem for custom python

阅读更多关于 ubuntu 11.04 lxml import etree problem for custom python

问题 ubuntu 11.04 has native python2.7 i build python2.5 from source to /usr/local/python2.5/bin, and try to install lxml for my custom python2.5 install. Also i use virtualenv. I switch to my env with python2.5. On import lxml i got an error. from lxml import etree ImportError: /home/se7en/.virtualenvs/e-py25/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so: undefined symbol: PyUnicodeUCS2_DecodeLatin1 With python2.7 env, all is ok but on python2.5 import fails. Please

pip error: unrecognized command line option ‘-fstack-protector-strong’

阅读更多关于 pip error: unrecognized command line option ‘-fstack-protector-strong’

问题 When I sudo pip install pyquery , sudo pip install lxml , and sudo pip install cython , I get very similar output with the same error that says: x86_64-linux-gnu-gcc: error: unrecognized command line option ‘-fstack-protector-strong’ Here's the complete pip output for sudo pip install pyquery : Requirement already satisfied (use --upgrade to upgrade): pyquery in /usr/local/lib/python2.7/dist-packages Downloading/unpacking lxml>=2.1 (from pyquery) Running setup.py egg_info for package lxml

Scrapy的安装：Windows、linux、mac等操作平台

阅读更多关于 Scrapy的安装：Windows、linux、mac等操作平台

Scrapy安装 Scrapy的安装有多种方式，它支持Python2.7版本及以上或者是Python3.3版本及以上。下面来说py3环境下，scrapy的安装过程。 Scrapy依赖的库比较多，至少需要依赖库Twisted 14.0，lxml 3.4，pyOpenSSL 0.14。在不同平台环境又不相同，所以在安装前确保先把一些基本库安装好，尤其是Windows。一、Anaconda 这种方法是一种比较简单的安装scrapy的方法（尤其对Windows来说），你可以使用该方法安装。也可以选用下文中专用平台的安装方法。 Anaconda是包含了常用的数据科学库的Python发行版本，如果没有安装，可以到官网https://www.continuum.io/downloads下载对应平台的包安装。如果已经安装了，可以通过conda命令安装scrapy。安装如下：先打开Anaconda的 Anaconda Prompt 输入 conda install Scrapy 如图所示表示安装成功：二、Windows 1.安装lxml 　　最好的安装方式是通过wheel文件来安装，http://www.lfd.uci.edu/~gohlke/pythonlibs/，这个网站真的是windows用户的福音，基本上python的库里面都有，称他为python万能库网站

Can CDATA sections be preserved by BeautifulSoup?

阅读更多关于 Can CDATA sections be preserved by BeautifulSoup?

问题 I'm using BeautifulSoup to read, modify, and write an XML file. I'm having trouble with CDATA sections being stripped out. Here's a simplified example. The culprit XML file: <?xml version="1.0" ?> <foo> <bar><![CDATA[ !@#$%^&*()_+{}|:"<>?,./;'[]\-= ]]></bar> </foo> And here's the Python script. from bs4 import BeautifulSoup xmlfile = open("cdata.xml", "r") soup = BeautifulSoup( xmlfile, "xml" ) print(soup) Here's the output. Note the CDATA section tags are missing. <?xml version="1.0"

Python爬虫+简易词云的制作

阅读更多关于 Python爬虫+简易词云的制作

Python爬虫+简易词云的制作写在前面再识Python 简介：应用场景： Python命令行执行：基本语法：连接数据库： Python爬虫主要步骤：第一种爬虫：urllib基本库+Beautiful Soup urllib Beautiful Soup 第二种爬虫：Scrapy+xpath Scrapy xpath 简易词云写在前面这篇博客是我在大连参加实训时所作，大部分内容为课堂知识记录，也有自己遇到的问题及解决方法，记下来方便自己查阅，也和大家一起学习ヽ(ﾟ∀ﾟ)ﾒ(ﾟ∀ﾟ)ﾉ。我用的python版本：python-3.7.4 使用的IDE：PyCharm 再识Python 简介： Python是荷兰人Guido van Rossum在1989年圣诞节期间，为了打发无聊的圣诞节而编写的一个编程语言（TAT）。89年出现第一个版本，比java早诞生，但是没火，因为是脚本语言，解释执行，运行慢。它的优点是代码量极少，算法入库，跨平台，多用于解决算法问题，是面向对象解释型的编程语言，定位是“优雅”、“明确”、“简单”。 ps：AWS亚马逊云服务是最顶尖的云服务。py可以在云服务器上运行，在硬件上弥补运行慢的问题。其他缺点： GIL（Global Interpreter Lock）全局解释器锁，这是一种防止多线程并发执行机器码的互斥锁

how to cast a variable in xpath python

阅读更多关于 how to cast a variable in xpath python

问题 from lxml import html import requests pagina = 'http://www.beleggen.nl/amx' page = requests.get(pagina) tree = html.fromstring(page.text) aandeel = tree.xpath('//a[@title="Imtech"]/text()') print aandeel This part works, but I want to read multiple lines with different titles, is it possible to change the "Imtech" part to a variable? Something like this, it obviously doesnt work, but where did I go wrong? Or is it not quite this easy? FondsName = "Imtech" aandeel = tree.xpath('//a[@title="%s"

Should I use .text or .content when parsing a Requests response?

阅读更多关于 Should I use .text or .content when parsing a Requests response?

问题 I occasionally use res.content or res.text to parse a response from Requests. In the use cases I have had, it didn't seem to matter which option I used. What is the main difference in parsing HTML with .content or .text ? For example: import requests from lxml import html res = requests.get(...) node = html.fromstring(res.content) In the above situation, should I be using res.content or res.text ? What is a good rule of thumb for when to use each? 回答1: From the documentation: When you make a

How to tell lxml.etree.tostring(element) not to write namespaces in python?

阅读更多关于 How to tell lxml.etree.tostring(element) not to write namespaces in python?

问题 I have a huge xml file (1 Gig). I want to move some of the elements (entrys) to another file with the same header and specifications. Let's say the original file contains this entry with tag <to_move> : <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE some SYSTEM "some.dtd"> <some> ... <to_move date="somedate"> <child>some text</child> ... ... </to_move> ... </some> I use lxml.etree.iterparse to iterate through the file. Works fine. When I find the element with tag <to_move> , let's