lxml

Python : lxml xpath get two different classes

余生颓废 提交于 2019-12-25 02:54:14
问题 Here is my sample Python code import requests import lxml.html page = '<div class="aaaa12"><span class="test">22</span><span class="number">33</span></div><div class="dddd13"><span>Kevin</span></div>' tree = lxml.html.fromstring(page) number = tree.xpath('//span[@class="number"]/text()') price = tree.xpath('.//div[@class="dddd13"]/span/text()') print number print price When I ran I could get like below ['33'] ['Kevin'] However, I would like to get both at once like = ['33','Kevin'] I tried

Get attribute of complex element using lxml

£可爱£侵袭症+ 提交于 2019-12-25 02:42:31
问题 I have a simple file XML like below: <brandName type="http://example.com/codes/bmw#" abbrev="BMW" value="BMW" />BMW</brandName> <maxspeed> <value>250</value> <unit type="http://example.com/codes/units#" value="miles per hour" abbrev="mph" /> </maxspeed> I want to parse it using lxml and get the value of it: With brandName, it just need: 'brand_name' : m.findtext(NS+'brandName') If I want to get into abbrev attribute of it. 'brand_name' : m.findtext(NS+'brandName').attrib['abbrev'] With

lxml: Force to convert newlines to entities

早过忘川 提交于 2019-12-25 01:54:23
问题 Is there a way to output newlines inside text elements as entities? Currently, newlines are inserted into output as-is: from lxml import etree from lxml.builder import E etree.tostring(E.a('one\ntwo'), pretty_print=True) b'<a>one\ntwo</a>\n' Desired output: b'<a>one two</a>\n' 回答1: After looking through the lxml docs, it looks like there is no way to force certain characters to be printed as escaped entities. It also looks like the list of characters that gets escaped varies by the output

lxml encoding errors on production

我的梦境 提交于 2019-12-24 23:34:35
问题 I am trying to process some data with lxml. It works fine on my development server, but on production the following code: parser = etree.XMLParser(encoding='cp1251') throws: File "parser.pxi", line 1288, in lxml.etree.XMLParser.__init__ (third_party/apphosting/python/lxml/src/lxml/lxml.etree.c:77726) File "parser.pxi", line 738, in lxml.etree._BaseParser.__init__ (third_party/apphosting/python/lxml/src/lxml/lxml.etree.c:73404) LookupError: unknown encoding: 'cp1251' I am using lxml 2.3. The

python3 BeautifulSoup模块

▼魔方 西西 提交于 2019-12-24 21:01:37
一、安装下载: 1、安装: pip install beautifulsoup4 2、可选择安装解析器:pip install lxmlpip install html5lib3、解析器比较: 解析器 使用方法 优势 劣势 Python标准库 BeautifulSoup(markup, "html.parser") Python的内置标准库 执行速度适中 文档容错能力强 Python 2.7.3 or 3.2.2)前 的版本中文档容错能力差 lxml HTML 解析器 BeautifulSoup(markup, "lxml") 速度快 文档容错能力强 需要安装C语言库 lxml XML 解析器 BeautifulSoup(markup, ["lxml", "xml"]) BeautifulSoup(markup, "xml") 速度快 唯一支持XML的解析器 需要安装C语言库 html5lib BeautifulSoup(markup, "html5lib") 最好的容错性 以浏览器的方式解析文档 生成HTML5格式的文档 速度慢 不依赖外部扩展 二、BS的使用: from bs4 import BeautifulSoupimport requests,rereq_obj = requests.get('https://www.baidu.com')soup =

Parsing HTML documents using lxml in python

拈花ヽ惹草 提交于 2019-12-24 20:39:03
问题 I just downloaded lxml to parse broken HTML documents. I was reading through the documentation of lxml but could not find that given a HTML document, how do we just retrieve the text in the document using lxml. I will be obliged if someone could help me with this. 回答1: It's very simple: from lxml import html html_document = ... #Get your document contents here from a file or whatever tree = html.fromstring(html_document) text_document = tree.text_content() If you only want the content from

rewrite ElementTree code in lxml

跟風遠走 提交于 2019-12-24 20:27:21
问题 I am writing a code to extract text from a xml file using ElementTree but I found out that lxml is giving xpath features which is more convenient. So i want to know how this line could be rewritten in lxml if x.nodeName == 'a:pPr' and x.getAttribute('lvl') == '2' and x.hasAttribute('marL') == False: currently I am suggested to use this.. '/p:sld/p:cSld/p:spTree/p:sp/p:nvSpPr/p:nvPr/x[@type="body" and @sz="quarter" and @marL]' Hope my question is clear! 回答1: I'm assuming you are already at a

lxml not working with django, scraperwiki

拈花ヽ惹草 提交于 2019-12-24 19:01:32
问题 I'm working on a django app that goes through Illinois' General Assembly website to scrape some pdfs. While deployed on my desktop it works fine until urllib2 times out. When I try to deploy on my Bluehost server, the lxml part of the code throws up an error. Any help would be appreciated. import scraperwiki from bs4 import BeautifulSoup import urllib2 import lxml.etree import re from django.core.management.base import BaseCommand from legi.models import Votes class Command(BaseCommand): def

爬虫之解析库-----re、beautifulsoup、pyquery

旧街凉风 提交于 2019-12-24 17:56:28
一、介绍 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.你可能在寻找 Beautiful Soup3 的文档,Beautiful Soup 3 目前已经停止开发,官网推荐在现在的项目中使用Beautiful Soup 4, 移植到BS4 安装:Beautifulsoup4 pip3 install beautifulsoup4 安装解释器: Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml: 下表列出了主要的解析器,以及它们的优缺点,官网推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定. 解析器 使用方法 优势 劣势 Python标准库 BeautifulSoup(markup, "html.parser") Python的内置标准库 执行速度适中 文档容错能力强 Python 2.7.3 or 3.2.2)前

How to obtain Element values from a KML by using lmxl

做~自己de王妃 提交于 2019-12-24 15:42:17
问题 My problem is very similar to the one found here: How to pull data from KML/XML? The answer to the above question is to use Nokogiri to fix the format. I wonder if there is a way to solve a similar problem without fixing the format first. How can I get the values of the dict, so that I can get 'FM2' and 'FM3' from the Element SimpleData below? Here is my kml: <?xml version="1.0" encoding="UTF-8"?> <kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns