lxml

pip error: unrecognized command line option ‘-fstack-protector-strong’

寵の児 提交于 2019-11-29 09:21:12
When I sudo pip install pyquery , sudo pip install lxml , and sudo pip install cython , I get very similar output with the same error that says: x86_64-linux-gnu-gcc: error: unrecognized command line option ‘-fstack-protector-strong’ Here's the complete pip output for sudo pip install pyquery : Requirement already satisfied (use --upgrade to upgrade): pyquery in /usr/local/lib/python2.7/dist-packages Downloading/unpacking lxml>=2.1 (from pyquery) Running setup.py egg_info for package lxml /usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'

Extracting XML into data frame with parent attribute as column title

与世无争的帅哥 提交于 2019-11-29 08:09:19
I have thousands of XML files that I will be processing, and they have a similar format, but different parent names and different numbers of parents. Through books, google, tutorials, and just trying out codes, I've been able to pull out all of this data. See, for example: Parsing xml to pandas data frame throws memory error and Dynamic search through xml attributes using lxml and xpath in python However, I realized that I was extracting the data poorly, with a child "Time" repeated for each parent. Here is what I am trying to get. Time blah abc 1200 100 2 1300 30 4 1400 70 2 Here is what I

Can CDATA sections be preserved by BeautifulSoup?

ぐ巨炮叔叔 提交于 2019-11-29 07:49:32
I'm using BeautifulSoup to read, modify, and write an XML file. I'm having trouble with CDATA sections being stripped out. Here's a simplified example. The culprit XML file: <?xml version="1.0" ?> <foo> <bar><![CDATA[ !@#$%^&*()_+{}|:"<>?,./;'[]\-= ]]></bar> </foo> And here's the Python script. from bs4 import BeautifulSoup xmlfile = open("cdata.xml", "r") soup = BeautifulSoup( xmlfile, "xml" ) print(soup) Here's the output. Note the CDATA section tags are missing. <?xml version="1.0" encoding="utf-8"?> <foo> <bar> !@#$%^&*()_+{}|:"<>?,./;'[]\-= </bar> </foo> I also tried printing soup

Scrapy crawl with next page

早过忘川 提交于 2019-11-29 07:22:26
I have this code for scrapy framework: # -*- coding: utf-8 -*- import scrapy from scrapy.contrib.spiders import Rule from scrapy.linkextractors import LinkExtractor from lxml import html class Scrapy1Spider(scrapy.Spider): name = "scrapy1" allowed_domains = ["sfbay.craigslist.org"] start_urls = ( 'http://sfbay.craigslist.org/search/npo', ) rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse", follow= True),) def parse(self, response): site = html.fromstring(response.body_as_unicode()) titles = site.xpath('//div[@class="content"]/p[@class="row"

how to cast a variable in xpath python

狂风中的少年 提交于 2019-11-29 07:12:42
from lxml import html import requests pagina = 'http://www.beleggen.nl/amx' page = requests.get(pagina) tree = html.fromstring(page.text) aandeel = tree.xpath('//a[@title="Imtech"]/text()') print aandeel This part works, but I want to read multiple lines with different titles, is it possible to change the "Imtech" part to a variable? Something like this, it obviously doesnt work, but where did I go wrong? Or is it not quite this easy? FondsName = "Imtech" aandeel = tree.xpath('//a[@title="%s"]/text()')%(FondsName) print aandeel You were almost right: variabelen = [var1,var2,var3] for var in

Should I use .text or .content when parsing a Requests response?

感情迁移 提交于 2019-11-29 06:59:30
I occasionally use res.content or res.text to parse a response from Requests . In the use cases I have had, it didn't seem to matter which option I used. What is the main difference in parsing HTML with .content or .text ? For example: import requests from lxml import html res = requests.get(...) node = html.fromstring(res.content) In the above situation, should I be using res.content or res.text ? What is a good rule of thumb for when to use each? From the documentation : When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The

How to not load the comments while parsing XML in lxml

穿精又带淫゛_ 提交于 2019-11-29 06:35:02
I try to parse XML file in Python using lxml like this: objectify.parse(xmlPath, parserWithSchema) but XML file may contains comments in strange places: <root> <text>Sam<!--comment-->ple text</text> <!--comment--> <float>1.2<!--comment-->3456</float> </root> It is a way to not load or delete comments before parsing? Set remove_comments=True on the parser ( documentation ): from lxml import etree, objectify parser = etree.XMLParser(remove_comments=True) tree = objectify.parse(xmlPath, parser=parser) Or, using the makeparser() method: parser = objectify.makeparser(remove_comments=True) tree =

python lxml - modify attributes

爷,独闯天下 提交于 2019-11-29 05:33:40
from lxml import objectify, etree root = etree.fromstring('''<?xml version="1.0" encoding="ISO-8859-1" ?> <scenario> <init> <send channel="channel-Gy"> <command name="CER"> <avp name="Origin-Host" value="router1dev"></avp> <avp name="Origin-Realm" value="realm.dev"></avp> <avp name="Host-IP-Address" value="0x00010a248921"></avp> <avp name="Vendor-Id" value="11"></avp> <avp name="Product-Name" value="HP Ro Interface"></avp> <avp name="Origin-State-Id" value="1094807040"></avp> <avp name="Supported-Vendor-Id" value="10415"></avp> <avp name="Auth-Application-Id" value="4"></avp> <avp name="Acct

How to install lxml for python without administative rights on linux?

笑着哭i 提交于 2019-11-29 04:51:49
I just need some packages which dont present at the host machine (and I and linux... we... we didn't spend much time together...). I used to install them like: # from the source python setup.py install --user or # with easy_install easy_install prefix=~/.local package But it doesn't work with lxml. I get a lot of errors during the build: x:~/lxml-2.3$ python setup.py build Building lxml version 2.3. Building without Cython. ERROR: /bin/sh: xslt-config: command not found ** make sure the development packages of libxml2 and libxslt are installed ** Using build configuration of libxslt running

How can one replace an element with text in lxml?

老子叫甜甜 提交于 2019-11-29 03:53:32
It's easy to completely remove a given element from an XML document with lxml's implementation of the ElementTree API, but I can't see an easy way of consistently replacing an element with some text. For example, given the following input: input = '''<everything> <m>Some text before <r/></m> <m><r/> and some text after.</m> <m><r/></m> <m>Text before <r/> and after</m> <m><b/> Text after a sibling <r/> Text before a sibling<b/></m> </everything> ''' ... you could easily remove every <r> element with: from lxml import etree f = etree.fromstring(data) for r in f.xpath('//r'): r.getparent()