lxml

Retrieving a subset of href's from findall() in BeautifulSoup

依然范特西╮ 提交于 2019-12-12 10:15:01
问题 My goal is to write a python script that takes an artist's name as a string input and then appends it to the base URL that goes to the genius search query.Then retrieves all the lyrics from the returned web page's links (Which is the required subset of this problem that will also contain specifically the artist name in every link in that subset.).I am in the initial phase right now and just have been able to retrieve all links from the web page including the ones that I don't want in my

lxml: cssselect(): AttributeError: 'lxml.etree._Element' object has no attribute 'cssselect'

南楼画角 提交于 2019-12-12 09:58:00
问题 Can someone explain why the first call to root.cssselect() works, while the second fails? from lxml.html import fromstring from lxml import etree html='<html><a href="http://example.com">example</a></html' root = fromstring(html) print 'via fromstring', repr(root) # via fromstring <Element html at 0x...> print root.cssselect("a") root2 = etree.HTML(html) print 'via etree.HTML()', repr(root2) # via etree.HTML() <Element html at 0x...> root2.cssselect("a") # --> Exception I get: Traceback (most

lxml.etree and xml.etree.ElementTree adding namespaces without prefixes(ns0, ns1, etc.)

孤街醉人 提交于 2019-12-12 09:53:33
问题 There is any solution to add namespaces without prefix(i mean these ns0, ns1) which working on all the etree implementations or there are working solutions for each one? For now I have solutions for: lxml - nsmap argument of Element (c)ElementTree (python 2.6+) - register namespace method with empty string as a prefix The problem is (c)ElementTree in python 2.5, I know there is _namespace_map attribute but setting it to empty string creating invalid XML, setting it to None adding default ns0

Using Python and lxml to strip only the tags that have certain attributes/values

我怕爱的太早我们不能终老 提交于 2019-12-12 08:25:55
问题 I'm familiar with etree's strip_tags and strip_elements methods, but I'm looking for a straightforward way of stripping tags (and leaving their contents) that only contain particular attributes/values. For instance: I'd like to strip all span or div tags (or other elements) from a tree ( xhtm l) that have a class='myclass' attribute/value (preserving the element's contents like strip_tags would do). Meanwhile, those same elements that don't have class='myclass' should remain untouched.

Get data between two tags in Python

好久不见. 提交于 2019-12-12 08:17:21
问题 <h3> <a href="article.jsp?tp=&arnumber=16"> Granular computing based <span class="snippet">data</span> <span class="snippet">mining</span> in the views of rough set and fuzzy set </a> </h3> Using Python I want to get the values from the anchor tag which should be Granular computing based data mining in the views of rough set and fuzzy set I tried using lxml parser = etree.HTMLParser() tree = etree.parse(StringIO.StringIO(html), parser) xpath1 = "//h3/a/child::text() | //h3/a/span/child::text(

remove certain attributes from HTML tags

天涯浪子 提交于 2019-12-12 08:08:06
问题 How can I remove certain attributes such as id, style, class, etc. from HTML code? I thought I could use the lxml.html.clean module, but as it turned out I can only remove style attributes with Clean(style=True).clean_html(code) . I'd prefer not to use regular expressions for this task (attributes could change). What I would like to have: from lxml.html.clean import Cleaner code = '<tr id="ctl00_Content_AdManagementPreview_DetailView_divNova" class="Extended" style="display: none;">' cleaner

Python how to strip white-spaces from xml text nodes

邮差的信 提交于 2019-12-12 07:18:58
问题 I have a xml file as follows <Person> <name> My Name </name> <Address>My Address</Address> </Person> The tag has extra new lines, Is there any quick Pythonic way to trim this and generate a new xml. I found this but it trims only which are between tags not the value https://skyl.org/log/post/skyl/2010/04/remove-insignificant-whitespace-from-xml-string-with-python/ Update 1 - Handle following xml which has tail spaces in <name> tag <Person> <name> My Name<shortname>My</short> </name> <Address

Install lxml on Centos 7 - error: command 'gcc' failed with exit status 4

纵饮孤独 提交于 2019-12-12 07:13:35
问题 I'm using python 3.4 in a virtual environment: (af)[root@domain backend]# pip --version pip 7.1.0 from /home/af/af-stage/backend/.ves/af/lib/python3.4/site-packages (python 3.4) Installation of lxml failed "error: command 'gcc' failed with exit status 4": (af)[root@domain backend]# pip install lxml You are using pip version 7.1.0, however version 7.1.2 is available. You should consider upgrading via the 'pip install --upgrade pip' command. Collecting lxml Using cached lxml-3.5.0.tar.gz

getting attribute of an element with its corresponding Id

喜欢而已 提交于 2019-12-12 05:19:49
问题 suppose that i have this xml file : <article-set xmlns:ns0="http://casfwcewf.xsd" format-version="5"> <article> <article id="11234"> <source> <hostname>some hostname for 11234</hostname> </source> <feed> <type weight=0.32>RSS</type> </feed> <uri>some uri for 11234</uri> </article> <article id="63563"> <source> <hostname>some hostname for 63563 </hostname> </source> <feed> <type weight=0.86>RSS</type> </feed> <uri>some uri for 63563</uri> </article> . . . </article></article-set> what I want,

Issue in reading text in XML using python

a 夏天 提交于 2019-12-12 05:03:47
问题 I am trying to read the following XML file which has following content: <tu creationdate="20100624T160543Z" creationid="SYSTEM" usagecount="0"> <prop type="x-source-tags">1=A,2=B</prop> <prop type="x-target-tags">1=A,2=B</prop> <tuv xml:lang="EN"> <seg>Modified <ut x="1"/>Denver<ut x="2"/> Score</seg> </tuv> <tuv xml:lang="DE"> <seg>Modifizierter <ut x="1"/>Denver<ut x="2"/>-Score</seg> </tuv> </tu> using the following code tree = ET.parse(tmx) root = tree.getroot() seg = root.findall('.//seg