lxml | 易学教程

pip error: unrecognized command line option ‘-fstack-protector-strong’

阅读更多关于 pip error: unrecognized command line option ‘-fstack-protector-strong’

When I sudo pip install pyquery , sudo pip install lxml , and sudo pip install cython , I get very similar output with the same error that says: x86_64-linux-gnu-gcc: error: unrecognized command line option ‘-fstack-protector-strong’ Here's the complete pip output for sudo pip install pyquery : Requirement already satisfied (use --upgrade to upgrade): pyquery in /usr/local/lib/python2.7/dist-packages Downloading/unpacking lxml>=2.1 (from pyquery) Running setup.py egg_info for package lxml /usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'

Extracting XML into data frame with parent attribute as column title

阅读更多关于 Extracting XML into data frame with parent attribute as column title

I have thousands of XML files that I will be processing, and they have a similar format, but different parent names and different numbers of parents. Through books, google, tutorials, and just trying out codes, I've been able to pull out all of this data. See, for example: Parsing xml to pandas data frame throws memory error and Dynamic search through xml attributes using lxml and xpath in python However, I realized that I was extracting the data poorly, with a child "Time" repeated for each parent. Here is what I am trying to get. Time blah abc 1200 100 2 1300 30 4 1400 70 2 Here is what I

Can CDATA sections be preserved by BeautifulSoup?

阅读更多关于 Can CDATA sections be preserved by BeautifulSoup?

I'm using BeautifulSoup to read, modify, and write an XML file. I'm having trouble with CDATA sections being stripped out. Here's a simplified example. The culprit XML file: <?xml version="1.0" ?> <foo> <bar><![CDATA[ !@#$%^&*()_+{}|:"<>?,./;'[]\-= ]]></bar> </foo> And here's the Python script. from bs4 import BeautifulSoup xmlfile = open("cdata.xml", "r") soup = BeautifulSoup( xmlfile, "xml" ) print(soup) Here's the output. Note the CDATA section tags are missing. <?xml version="1.0" encoding="utf-8"?> <foo> <bar> !@#$%^&*()_+{}|:"<>?,./;'[]\-= </bar> </foo> I also tried printing soup

Scrapy crawl with next page

阅读更多关于 Scrapy crawl with next page

I have this code for scrapy framework: # -*- coding: utf-8 -*- import scrapy from scrapy.contrib.spiders import Rule from scrapy.linkextractors import LinkExtractor from lxml import html class Scrapy1Spider(scrapy.Spider): name = "scrapy1" allowed_domains = ["sfbay.craigslist.org"] start_urls = ( 'http://sfbay.craigslist.org/search/npo', ) rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse", follow= True),) def parse(self, response): site = html.fromstring(response.body_as_unicode()) titles = site.xpath('//div[@class="content"]/p[@class="row"

how to cast a variable in xpath python

阅读更多关于 how to cast a variable in xpath python

from lxml import html import requests pagina = 'http://www.beleggen.nl/amx' page = requests.get(pagina) tree = html.fromstring(page.text) aandeel = tree.xpath('//a[@title="Imtech"]/text()') print aandeel This part works, but I want to read multiple lines with different titles, is it possible to change the "Imtech" part to a variable? Something like this, it obviously doesnt work, but where did I go wrong? Or is it not quite this easy? FondsName = "Imtech" aandeel = tree.xpath('//a[@title="%s"]/text()')%(FondsName) print aandeel You were almost right: variabelen = [var1,var2,var3] for var in

Should I use .text or .content when parsing a Requests response?

阅读更多关于 Should I use .text or .content when parsing a Requests response?

I occasionally use res.content or res.text to parse a response from Requests . In the use cases I have had, it didn't seem to matter which option I used. What is the main difference in parsing HTML with .content or .text ? For example: import requests from lxml import html res = requests.get(...) node = html.fromstring(res.content) In the above situation, should I be using res.content or res.text ? What is a good rule of thumb for when to use each? From the documentation : When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The

How to not load the comments while parsing XML in lxml

阅读更多关于 How to not load the comments while parsing XML in lxml

I try to parse XML file in Python using lxml like this: objectify.parse(xmlPath, parserWithSchema) but XML file may contains comments in strange places: <root> <text>Sample text</text>  <float>1.23456</float> </root> It is a way to not load or delete comments before parsing? Set remove_comments=True on the parser ( documentation ): from lxml import etree, objectify parser = etree.XMLParser(remove_comments=True) tree = objectify.parse(xmlPath, parser=parser) Or, using the makeparser() method: parser = objectify.makeparser(remove_comments=True) tree =

python lxml - modify attributes

阅读更多关于 python lxml - modify attributes

from lxml import objectify, etree root = etree.fromstring('''<?xml version="1.0" encoding="ISO-8859-1" ?> <scenario> <init> <send channel="channel-Gy"> <command name="CER"> <avp name="Origin-Host" value="router1dev"></avp> <avp name="Origin-Realm" value="realm.dev"></avp> <avp name="Host-IP-Address" value="0x00010a248921"></avp> <avp name="Vendor-Id" value="11"></avp> <avp name="Product-Name" value="HP Ro Interface"></avp> <avp name="Origin-State-Id" value="1094807040"></avp> <avp name="Supported-Vendor-Id" value="10415"></avp> <avp name="Auth-Application-Id" value="4"></avp> <avp name="Acct

How to install lxml for python without administative rights on linux?

阅读更多关于 How to install lxml for python without administative rights on linux?

I just need some packages which dont present at the host machine (and I and linux... we... we didn't spend much time together...). I used to install them like: # from the source python setup.py install --user or # with easy_install easy_install prefix=~/.local package But it doesn't work with lxml. I get a lot of errors during the build: x:~/lxml-2.3$ python setup.py build Building lxml version 2.3. Building without Cython. ERROR: /bin/sh: xslt-config: command not found ** make sure the development packages of libxml2 and libxslt are installed ** Using build configuration of libxslt running

How can one replace an element with text in lxml?

阅读更多关于 How can one replace an element with text in lxml?

It's easy to completely remove a given element from an XML document with lxml's implementation of the ElementTree API, but I can't see an easy way of consistently replacing an element with some text. For example, given the following input: input = '''<everything> <m>Some text before <r/></m> <m><r/> and some text after.</m> <m><r/></m> <m>Text before <r/> and after</m> <m><b/> Text after a sibling <r/> Text before a sibling<b/></m> </everything> ''' ... you could easily remove every <r> element with: from lxml import etree f = etree.fromstring(data) for r in f.xpath('//r'): r.getparent()