Parsing text from XML node in Python

帅比萌擦擦* 提交于 2019-11-26 14:57:41

问题


I'm trying to extract URLs from a sitemap like this: https://www.bestbuy.com/sitemap_c_0.xml.gz

I've unzipped and saved the .xml.gz file as an .xml file. The structure looks like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xhtml="http://www.w3.org/1999/xhtml" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
    <loc>https://www.bestbuy.com/</loc>
    <priority>0.0</priority>
</url>
<url>
    <loc>https://www.bestbuy.com/site/3d-printers/3d-printer-filament/pcmcat335400050008.c?id=pcmcat335400050008</loc>
    <priority>0.0</priority>
</url>
<url>
    <loc>https://www.bestbuy.com/site/3d-printers/3d-printing-accessories/pcmcat748300527647.c?id=pcmcat748300527647</loc>
    <priority>0.0</priority>
</url>

I'm attempting to use ElementTree to extract all of the URLs within the loc nodes throughout this file, but struggling to get it working right.

Per the documentation, I'm trying something like this:

import xml.etree.ElementTree as ET
tree = ET.parse('my_local_filepath')
root = tree.getroot()

value = root.findall(".//loc")

However, nothing gets loaded into value. My goal is to extract all of the URLs between the loc nodes and print it out into a new flat file. Where am I going wrong?


回答1:


We can iterate through the URLs, toss them into a list and write them to a file as such:

from xml.etree import ElementTree as ET

tree = ET.parse('test.xml')
root = tree.getroot()

name_space = '{http://www.sitemaps.org/schemas/sitemap/0.9}'

urls = []
for child in root.iter():
    for block in child.findall('{}url'.format(name_space)):
        for url in block.findall('{}loc'.format(name_space)):
            urls.append('{}\n'.format(url.text))

with open('sample_urls.txt', 'w+') as f:
    f.writelines(urls)
  • note we need to append the name space from the open urlset definition to properly parse the xml



回答2:


You were close in your attempt but like mzjn said in a comment, you didn't account for the default namespace (xmlns="http://www.sitemaps.org/schemas/sitemap/0.9").

Here's an example of how to account for the namespace:

import xml.etree.ElementTree as ET
tree = ET.parse('my_local_filepath')

ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}

for elem in tree.findall(".//sm:loc", ns):
    print(elem.text)

output:

https://www.bestbuy.com/
https://www.bestbuy.com/site/3d-printers/3d-printer-filament/pcmcat335400050008.c?id=pcmcat335400050008
https://www.bestbuy.com/site/3d-printers/3d-printing-accessories/pcmcat748300527647.c?id=pcmcat748300527647

Note that I used the namespace prefix sm, but you could use any NCName.

See here for more information on parsing XML with namespaces in ElementTree.



来源:https://stackoverflow.com/questions/52847343/parsing-text-from-xml-node-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!