lxml | 易学教程

Retrieving a subset of href's from findall() in BeautifulSoup

阅读更多关于 Retrieving a subset of href's from findall() in BeautifulSoup

问题 My goal is to write a python script that takes an artist's name as a string input and then appends it to the base URL that goes to the genius search query.Then retrieves all the lyrics from the returned web page's links (Which is the required subset of this problem that will also contain specifically the artist name in every link in that subset.).I am in the initial phase right now and just have been able to retrieve all links from the web page including the ones that I don't want in my

lxml: cssselect(): AttributeError: 'lxml.etree._Element' object has no attribute 'cssselect'

阅读更多关于 lxml: cssselect(): AttributeError: 'lxml.etree._Element' object has no attribute 'cssselect'

问题 Can someone explain why the first call to root.cssselect() works, while the second fails? from lxml.html import fromstring from lxml import etree html='<html><a href="http://example.com">example</a></html' root = fromstring(html) print 'via fromstring', repr(root) # via fromstring <Element html at 0x...> print root.cssselect("a") root2 = etree.HTML(html) print 'via etree.HTML()', repr(root2) # via etree.HTML() <Element html at 0x...> root2.cssselect("a") # --> Exception I get: Traceback (most

lxml.etree and xml.etree.ElementTree adding namespaces without prefixes(ns0, ns1, etc.)

阅读更多关于 lxml.etree and xml.etree.ElementTree adding namespaces without prefixes(ns0, ns1, etc.)

问题 There is any solution to add namespaces without prefix(i mean these ns0, ns1) which working on all the etree implementations or there are working solutions for each one? For now I have solutions for: lxml - nsmap argument of Element (c)ElementTree (python 2.6+) - register namespace method with empty string as a prefix The problem is (c)ElementTree in python 2.5, I know there is _namespace_map attribute but setting it to empty string creating invalid XML, setting it to None adding default ns0

Using Python and lxml to strip only the tags that have certain attributes/values

阅读更多关于 Using Python and lxml to strip only the tags that have certain attributes/values

问题 I'm familiar with etree's strip_tags and strip_elements methods, but I'm looking for a straightforward way of stripping tags (and leaving their contents) that only contain particular attributes/values. For instance: I'd like to strip all span or div tags (or other elements) from a tree ( xhtm l) that have a class='myclass' attribute/value (preserving the element's contents like strip_tags would do). Meanwhile, those same elements that don't have class='myclass' should remain untouched.

Get data between two tags in Python

阅读更多关于 Get data between two tags in Python

问题 <h3> <a href="article.jsp?tp=&arnumber=16"> Granular computing based <span class="snippet">data</span> <span class="snippet">mining</span> in the views of rough set and fuzzy set </a> </h3> Using Python I want to get the values from the anchor tag which should be Granular computing based data mining in the views of rough set and fuzzy set I tried using lxml parser = etree.HTMLParser() tree = etree.parse(StringIO.StringIO(html), parser) xpath1 = "//h3/a/child::text() | //h3/a/span/child::text(

remove certain attributes from HTML tags

阅读更多关于 remove certain attributes from HTML tags

问题 How can I remove certain attributes such as id, style, class, etc. from HTML code? I thought I could use the lxml.html.clean module, but as it turned out I can only remove style attributes with Clean(style=True).clean_html(code) . I'd prefer not to use regular expressions for this task (attributes could change). What I would like to have: from lxml.html.clean import Cleaner code = '<tr id="ctl00_Content_AdManagementPreview_DetailView_divNova" class="Extended" style="display: none;">' cleaner

Python how to strip white-spaces from xml text nodes

阅读更多关于 Python how to strip white-spaces from xml text nodes

问题 I have a xml file as follows <Person> <name> My Name </name> <Address>My Address</Address> </Person> The tag has extra new lines, Is there any quick Pythonic way to trim this and generate a new xml. I found this but it trims only which are between tags not the value https://skyl.org/log/post/skyl/2010/04/remove-insignificant-whitespace-from-xml-string-with-python/ Update 1 - Handle following xml which has tail spaces in <name> tag <Person> <name> My Name<shortname>My</short> </name> <Address

Install lxml on Centos 7 - error: command 'gcc' failed with exit status 4

阅读更多关于 Install lxml on Centos 7 - error: command 'gcc' failed with exit status 4

问题 I'm using python 3.4 in a virtual environment: (af)[root@domain backend]# pip --version pip 7.1.0 from /home/af/af-stage/backend/.ves/af/lib/python3.4/site-packages (python 3.4) Installation of lxml failed "error: command 'gcc' failed with exit status 4": (af)[root@domain backend]# pip install lxml You are using pip version 7.1.0, however version 7.1.2 is available. You should consider upgrading via the 'pip install --upgrade pip' command. Collecting lxml Using cached lxml-3.5.0.tar.gz

getting attribute of an element with its corresponding Id

阅读更多关于 getting attribute of an element with its corresponding Id

问题 suppose that i have this xml file : <article-set xmlns:ns0="http://casfwcewf.xsd" format-version="5"> <article> <article id="11234"> <source> <hostname>some hostname for 11234</hostname> </source> <feed> <type weight=0.32>RSS</type> </feed> <uri>some uri for 11234</uri> </article> <article id="63563"> <source> <hostname>some hostname for 63563 </hostname> </source> <feed> <type weight=0.86>RSS</type> </feed> <uri>some uri for 63563</uri> </article> . . . </article></article-set> what I want,

Issue in reading text in XML using python

阅读更多关于 Issue in reading text in XML using python

问题 I am trying to read the following XML file which has following content: <tu creationdate="20100624T160543Z" creationid="SYSTEM" usagecount="0"> <prop type="x-source-tags">1=A,2=B</prop> <prop type="x-target-tags">1=A,2=B</prop> <tuv xml:lang="EN"> <seg>Modified <ut x="1"/>Denver<ut x="2"/> Score</seg> </tuv> <tuv xml:lang="DE"> <seg>Modifizierter <ut x="1"/>Denver<ut x="2"/>-Score</seg> </tuv> </tu> using the following code tree = ET.parse(tmx) root = tree.getroot() seg = root.findall('.//seg