lxml

French and lxml text

前提是你 提交于 2019-11-28 03:48:36
问题 I'm trying to assign a valid French text string to a text string using lxml: el = etree.Element("someelement") el.text = 'Disponible à partir du 1er Octobre' I get the error: ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters I've also tried: el.ext = etree.CDATA('Disponible à partir du 1er Octobre') However I get the same error. How do I handle French in XML, in particular, ISO-8859-1? There are ways to specify encoding within the tostring()

How to Pretty Print HTML to a file, with indentation

喜夏-厌秋 提交于 2019-11-28 03:22:56
I am using lxml.html to generate some HTML. I want to pretty print (with indentation) my final result into an html file. How do I do that? This is what I have tried and got till now (I am relatively new to Python and lxml) : import lxml.html as lh from lxml.html import builder as E sliderRoot=lh.Element("div", E.CLASS("scroll"), style="overflow-x: hidden; overflow-y: hidden;") scrollContainer=lh.Element("div", E.CLASS("scrollContainer"), style="width: 4340px;") sliderRoot.append(scrollContainer) print lh.tostring(sliderRoot, pretty_print = True, method="html") As you can see I am using the

ubuntu 11.04 lxml import etree problem for custom python

百般思念 提交于 2019-11-28 02:51:21
问题 ubuntu 11.04 has native python2.7 i build python2.5 from source to /usr/local/python2.5/bin, and try to install lxml for my custom python2.5 install. Also i use virtualenv. I switch to my env with python2.5. On import lxml i got an error. from lxml import etree ImportError: /home/se7en/.virtualenvs/e-py25/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so: undefined symbol: PyUnicodeUCS2_DecodeLatin1 With python2.7 env, all is ok but on python2.5 import fails. Please

pip error: unrecognized command line option ‘-fstack-protector-strong’

戏子无情 提交于 2019-11-28 02:47:33
问题 When I sudo pip install pyquery , sudo pip install lxml , and sudo pip install cython , I get very similar output with the same error that says: x86_64-linux-gnu-gcc: error: unrecognized command line option ‘-fstack-protector-strong’ Here's the complete pip output for sudo pip install pyquery : Requirement already satisfied (use --upgrade to upgrade): pyquery in /usr/local/lib/python2.7/dist-packages Downloading/unpacking lxml>=2.1 (from pyquery) Running setup.py egg_info for package lxml

Scrapy的安装:Windows、linux、mac等操作平台

不羁岁月 提交于 2019-11-28 01:29:06
Scrapy安装 Scrapy的安装有多种方式,它支持Python2.7版本及以上或者是Python3.3版本及以上。下面来说py3环境下,scrapy的安装过程。 Scrapy依赖的库比较多,至少需要依赖库Twisted 14.0,lxml 3.4,pyOpenSSL 0.14。在不同平台环境又不相同,所以在安装前确保先把一些基本库安装好,尤其是Windows。 一、Anaconda 这种方法是一种比较简单的安装scrapy的方法(尤其对Windows来说),你可以使用该方法安装。也可以选用下文中专用平台的安装方法。 Anaconda是包含了常用的数据科学库的Python发行版本,如果没有安装,可以到官网https://www.continuum.io/downloads下载对应平台的包安装。 如果已经安装了,可以通过conda命令安装scrapy。 安装如下: 先打开Anaconda的 Anaconda Prompt 输入 conda install Scrapy 如图所示表示安装成功: 二、Windows 1.安装lxml   最好的安装方式是通过wheel文件来安装,http://www.lfd.uci.edu/~gohlke/pythonlibs/,这个网站真的是windows用户的福音,基本上python的库里面都有,称他为python万能库网站

Can CDATA sections be preserved by BeautifulSoup?

本小妞迷上赌 提交于 2019-11-28 01:27:23
问题 I'm using BeautifulSoup to read, modify, and write an XML file. I'm having trouble with CDATA sections being stripped out. Here's a simplified example. The culprit XML file: <?xml version="1.0" ?> <foo> <bar><![CDATA[ !@#$%^&*()_+{}|:"<>?,./;'[]\-= ]]></bar> </foo> And here's the Python script. from bs4 import BeautifulSoup xmlfile = open("cdata.xml", "r") soup = BeautifulSoup( xmlfile, "xml" ) print(soup) Here's the output. Note the CDATA section tags are missing. <?xml version="1.0"

Python爬虫+简易词云的制作

梦想与她 提交于 2019-11-28 01:00:07
Python爬虫+简易词云的制作 写在前面 再识Python 简介: 应用场景: Python命令行执行: 基本语法: 连接数据库: Python爬虫 主要步骤: 第一种爬虫:urllib基本库+Beautiful Soup urllib Beautiful Soup 第二种爬虫:Scrapy+xpath Scrapy xpath 简易词云 写在前面 这篇博客是我在大连参加实训时所作,大部分内容为课堂知识记录,也有自己遇到的问题及解决方法,记下来方便自己查阅,也和大家一起学习ヽ(゚∀゚)メ(゚∀゚)ノ 。 我用的python版本:python-3.7.4 使用的IDE:PyCharm 再识Python 简介: Python是荷兰人Guido van Rossum在1989年圣诞节期间,为了打发无聊的圣诞节而编写的一个编程语言(TAT)。89年出现第一个版本,比java早诞生,但是没火,因为是脚本语言,解释执行,运行慢。它的优点是代码量极少,算法入库,跨平台,多用于解决算法问题,是面向对象解释型的编程语言,定位是“优雅”、“明确”、“简单”。 ps:AWS亚马逊云服务是最顶尖的云服务。py可以在云服务器上运行,在硬件上弥补运行慢的问题。 其他缺点: GIL(Global Interpreter Lock)全局解释器锁,这是一种防止多线程并发执行机器码的互斥锁

how to cast a variable in xpath python

大兔子大兔子 提交于 2019-11-28 00:52:54
问题 from lxml import html import requests pagina = 'http://www.beleggen.nl/amx' page = requests.get(pagina) tree = html.fromstring(page.text) aandeel = tree.xpath('//a[@title="Imtech"]/text()') print aandeel This part works, but I want to read multiple lines with different titles, is it possible to change the "Imtech" part to a variable? Something like this, it obviously doesnt work, but where did I go wrong? Or is it not quite this easy? FondsName = "Imtech" aandeel = tree.xpath('//a[@title="%s"

Should I use .text or .content when parsing a Requests response?

守給你的承諾、 提交于 2019-11-28 00:34:00
问题 I occasionally use res.content or res.text to parse a response from Requests. In the use cases I have had, it didn't seem to matter which option I used. What is the main difference in parsing HTML with .content or .text ? For example: import requests from lxml import html res = requests.get(...) node = html.fromstring(res.content) In the above situation, should I be using res.content or res.text ? What is a good rule of thumb for when to use each? 回答1: From the documentation: When you make a

How to tell lxml.etree.tostring(element) not to write namespaces in python?

谁都会走 提交于 2019-11-28 00:23:17
问题 I have a huge xml file (1 Gig). I want to move some of the elements (entrys) to another file with the same header and specifications. Let's say the original file contains this entry with tag <to_move> : <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE some SYSTEM "some.dtd"> <some> ... <to_move date="somedate"> <child>some text</child> ... ... </to_move> ... </some> I use lxml.etree.iterparse to iterate through the file. Works fine. When I find the element with tag <to_move> , let's