lxml

Stripping python namespace attributes from an lxml.objectify.ObjectifiedElement [duplicate]

蓝咒 提交于 2019-12-04 22:29:37
问题 This question already has answers here : Closed 7 years ago . Possible Duplicate: When using lxml, can the XML be rendered without namespace attributes? How can I strip the python attributes from an lxml.objectify.ObjectifiedElement ? Example: In [1]: from lxml import etree, objectify In [2]: foo = objectify.Element("foo") In [3]: foo.bar = "hi" In [4]: foo.baz = 1 In [5]: foo.fritz = None In [6]: print etree.tostring(foo, pretty_print=True) <foo xmlns:py="http://codespeak.net/lxml/objectify

Python BeautifulSoup equivalent to lxml make_links_absolute

喜欢而已 提交于 2019-12-04 21:55:59
问题 So lxml has a very hand feature: make_links_absolute: doc = lxml.html.fromstring(some_html_page) doc.make_links_absolute(url_for_some_html_page) and all the links in doc are absolute now. Is there an easy equivalent in BeautifulSoup or do I simply need to pass it through urlparse and normalize it: soup = BeautifulSoup(some_html_page) for tag in soup.findAll('a', href=True): url_data = urlparse(tag['href']) if url_data[0] == "": full_url = url_for_some_html_page + test_url 回答1: In my answer to

How to split the tags from html tree

北城以北 提交于 2019-12-04 21:48:47
This is my html tree <li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1"> Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a> </h3>Get the IndianOil Citibank <b>Card</b>. Apply Now! <br /> <a href="e%253DGOOGLE ------">Get 10X Rewards On Shopping</a> - <a href="S%2526eOfferCode%253DGSCCSLEX ------">Save Over 5% On Fuel</a> <br /> <cite>www.citibank.co.in/<b>CreditCards</b></cite> </li> From this html i need to extract the lines beforeth of < br > tag line1 : Get the IndianOil Citibank Card. Apply Now! line2 : Get 10X Rewards On Shopping - Save Over 5%

How to get css attribute of a lxml element?

喜你入骨 提交于 2019-12-04 19:45:20
I want to find a fast function to get all style properties of a lxml element that take into account the css stylesheet, the style attribute element and tackle the herit issue. For example : html : <body> <p>A</p> <p id='b'>B</p> <p style='color:blue'>B</p> </body> css : body {color:red;font-size:12px} p.b {color:pink;} python : elements = document.xpath('//p') print get_style(element[0]) >{color:red,font-size:12px} print get_style(element[1]) >{color:pink,font-size:12px} print get_style(element[2]) >{color:blue,font-size:12px} Thanks You can do this with a combination of lxml and cssutils .

Write xml with a path and value

笑着哭i 提交于 2019-12-04 19:30:15
I have a list of paths and values, something like this: [ {'Path': 'Item/Info/Name', 'Value': 'Body HD'}, {'Path': 'Item/Genres/Genre', 'Value': 'Action'}, ] And I want to build out the full xml structure, which would be: <Item> <Info> <Name>Body HD</Name> </Info> <Genres> <Genre>Action</Genre> </Genres> </Item> Is there a way to do this with lxml ? Or how could I build a function to fill in the inferred paths? Padraic Cunningham You could do something like: l = [ {'Path': 'Item/Info/Name', 'Value': 'Body HD'}, {'Path': 'Item/Genres/Genre', 'Value': 'Action'}, ] import lxml.etree as et root

How to parse HTML table against a list of variables using lxml?

风格不统一 提交于 2019-12-04 19:19:05
I am trying to parse an HTML table using lxml. While rows = outhtml.xpath('//tr/td/span[@class="boldred"]/text()') fetches the results, I am trying to extract the column contents only when it starts with a variable in my config file. For instance, if a <td> starts with 'Street 1', I then want to grab the <span> contents of that <td> tag. This way, I can have a tuple of tuples (which takes care of the None values) which I can then store in the database. lxml_parse.py import lxml.html as lh doc=open('test.htm', 'r') outhtml=lh.parse(doc) doc.close() rows = outhtml.xpath('//tr/td/span[@class=

Multiple XML Namespaces in tag with LXML

帅比萌擦擦* 提交于 2019-12-04 16:41:05
问题 I am trying to use Pythons LXML library to create a GPX file that can be read by Garmin's Mapsource Product. The header on their GPX files looks like this <?xml version="1.0" encoding="UTF-8" standalone="no" ?> <gpx xmlns="http://www.topografix.com/GPX/1/1" creator="MapSource 6.15.5" version="1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd"> When I use the following code: xmlns = "http:/

Set lxml as default BeautifulSoup parser

天涯浪子 提交于 2019-12-04 16:00:06
问题 I'm working on a web scraping project and have ran into problems with speed. To try to fix it, I want to use lxml instead of html.parser as BeautifulSoup's parser. I've been able to do this: soup = bs4.BeautifulSoup(html, 'lxml') but I don't want to have to repeatedly type 'lxml' every time I call BeautifulSoup. Is there a way I can set which parser to use once at the beginning of my program? 回答1: According to the Specifying the parser to use documentation page: The first argument to the

Multithreading for faster downloading

青春壹個敷衍的年華 提交于 2019-12-04 15:57:45
问题 How can I download multiple links simultaneously? My script below works but only downloads one at a time and it is extremely slow. I can't figure out how to incorporate multithreading in my script. The Python script: from BeautifulSoup import BeautifulSoup import lxml.html as html import urlparse import os, sys import urllib2 import re print ("downloading and parsing Bibles...") root = html.parse(open('links.html')) for link in root.findall('//a'): url = link.get('href') name = urlparse

python报错bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml.

泄露秘密 提交于 2019-12-04 15:44:56
qpython运行 原代码: soup = BeautifulSoup(r.text,'lxml') 报错:bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library? 改成: soup = BeautifulSoup(r.text,'html.parser') ok 来源: https://www.cnblogs.com/jiangsonglin/p/11871999.html