lxml | 易学教程

How to find all guide IDs and pages with IMG tags in XML export with lxml/xpath?

阅读更多关于 How to find all guide IDs and pages with IMG tags in XML export with lxml/xpath?

问题 How can I parse the below XML in order to find for each GUIDE, it's ID and UL, then for each PAGE inside GUIDE, the page ID and any images that appear inside BOXES / BOX / ASSETS / DESCRIPTION? The images are in HTML format so I need to grab the source from each image. <guide> <id></id> <url></url> <group> <id></id> <type></type> <name></name> </group> <pages> <page> <id></id> <name></name> <description></description> <boxes> <box> <id></id> <name></name> <type></type> <map_id></map_id>

lxml::etree::_ElementStringResult.getparent() works incorrectly

阅读更多关于 lxml::etree::_ElementStringResult.getparent() works incorrectly

问题 I did not find anyone explaining this error... I'm using lxml 3.1.0. When there is an HTML/XML like that: <h1 class="fn"><strong class="brand">Lange</strong> XT 100 LV Ski Boots 2014</h1> the _ElementStringResult of string " XT 100 LV Ski Boots 2014" will be returned when we run: >> elemstr = tree.xpath('//body//h1/text()')[0] However, when we run as follows, we would get... >> parent = elemstr.getparent() >> tree.getpath(parent) /html/body/therestofthepath/h1/strong Did anyone have a problem

parse html content by passing custom date input

阅读更多关于 parse html content by passing custom date input

问题 I am parsing data from here. On the webpage I can get data for example yesterday by selecting the desired date. How can I parse to get the same data (ie. yesterday)? Like, pass custom dates to get data for that date. 回答1: You can either use Selenium or use the site's ajax api. Here is an example of the latter: def get_by_date(date): url = 'https://markets.ft.com/data/world/ajax/getnextecoevents?startDate=' + date r = requests.get(url) return r.json()['html'] date should be formatted as yyyy

Accessing !ENTITY statement and reference

阅读更多关于 Accessing !ENTITY statement and reference

问题 I have some xml files with !ENTITY Definitions and &file_reference; And I can process these successfully. However I would like to preprocess the files and access the !ENTITY Definitions to extract file names and also the &file_references and which section of xml they are in An example XML file looks like <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE gdml [ <!ENTITY materials SYSTEM "materialsOptical.xml"> <!ENTITY solids_Mainz_v2 SYSTEM "solids_Mainz_v2.xml"> <!ENTITY matrices_Mainz_v2

Cx_freeze with lxml.html TypeError

阅读更多关于 Cx_freeze with lxml.html TypeError

问题 import lxml.html Gives me error when i want to compile with cx_freeze: Traceback (most recent call last): File "C:\Python27\Scripts\cxfreeze", line 5, in <module> main() File "C:\Python27\lib\site-packages\cx_Freeze\main.py", line 188, in main freezer.Freeze() File "C:\Python27\lib\site-packages\cx_Freeze\freezer.py", line 572, in Freeze self._FreezeExecutable(executable) File "C:\Python27\lib\site-packages\cx_Freeze\freezer.py", line 186, in _FreezeExecutable exe.copyDependentFiles,

No nested nodes. How to get one piece of information and then to get additional info respectively?

阅读更多关于 No nested nodes. How to get one piece of information and then to get additional info respectively?

问题 For the code below I need to get dates and their times+hrefs+formats+...(not shown) respectively. <div class="showtimes"> <h2>The Little Prince</h2> <div class="poster" data-poster-url="http://www.test.com"> <img src="http://www.test.com"> </div> <div class="showstimes"> <div class="date">9 December, Wednesday</div> <span class="show-time techno-3d"> <a href="http://www.test.com" class="link">12:30</a> <span class="show-format">3D</span> </span> <span class="show-time techno-3d"> <a href=

How to add an attribute to a tag found using xpath in lxml in Python?

阅读更多关于 How to add an attribute to a tag found using xpath in lxml in Python?

问题 I have the following xml - <draw:image></draw:image> I want to add multiple xlink attributes to it and make it - <draw:image xlink:href="image" xlink:show="embed"></draw:image> I tried using the following code but got the error "ValueError: Invalid attribute name u'xlink:href'" root.xpath("//draw:image", namespaces= {"draw":"urn:oasis:names:tc:opendocument:xmlns:drawing:1.0"}) [0].attrib['xlink:href'] = 'image' What am I doing wrong? There seems to be something related to namespaces, but I

parsing xml by python lxml tree.xpath

阅读更多关于 parsing xml by python lxml tree.xpath

问题 I try to parse a huge file. The sample is below. I try to take <Name> , but I can't It works only without this string <LevelLayout xmlns="http://schemas.datacontract.org/2004/07/ArcherTech.Common.Domain" xmlns:i="http://www.w3.org/2001/XMLSchema-instance"> xml2 = '''<?xml version="1.0" encoding="UTF-8"?> <PackageLevelLayout> <LevelLayouts> <LevelLayout levelGuid="4a54f032-325e-4988-8621-2cb7b49d8432"> <LevelLayout xmlns="http://schemas.datacontract.org/2004/07/ArcherTech.Common.Domain" xmlns

Python - Same xpath in selenium and lxml different results

阅读更多关于 Python - Same xpath in selenium and lxml different results

问题 I have this site http://www.google-proxy.net/ and i need to get first proxy's ip:port. br = webdriver.Firefox() br.get("http://www.google-proxy.net/") ip = br.find_element_by_xpath("//tr[@class='odd']/td[1]").text; time.sleep(random.uniform(1, 1)) port = br.find_element_by_xpath("//tr[@class='odd']/td[2]").text; time.sleep(random.uniform(1, 1)) and it works fine. But now i want to do the same with lxml page = requests.get(proxy_server) root = lxml.html.fromstring(page.text) ip = root.xpath("/

XSLT 1.0: max value of a date node

阅读更多关于 XSLT 1.0: max value of a date node

问题 Given following xml: <Ergebnisse> <Spiel> <Datum>2013-10-02</Datum> </Spiel> <Spiel> <Datum>2013-10-03</Datum> </Spiel> <Spiel> <Datum>2013-10-03</Datum> </Spiel> <Spiel> <Datum>2013-10-03</Datum> </Spiel> <Spiel> <Datum>2013-10-06</Datum> </Spiel> <Spiel> <Datum>2013-10-06</Datum> </Spiel> <Spiel> <Datum>2013-10-06</Datum> </Spiel> <Spiel> <Datum>2013-10-06</Datum> </Spiel> <Spiel> <Datum>2014-05-01</Datum> </Spiel> <Spiel> <Datum>2014-05-01</Datum> </Spiel> <Spiel> <Datum>2014-04-27</Datum>