Parsing text from XML node in Python

前端未结

关注

 3  912

傲寒 2020-12-02 02:36

I\'m trying to extract URLs from a sitemap like this: https://www.bestbuy.com/sitemap_c_0.xml.gz

I\'ve unzipped and saved the .xml.gz file as an .xml file. The struc

3条回答

渐次进展 (楼主)

2020-12-02 03:29

We can iterate through the URLs, toss them into a list and write them to a file as such:

from xml.etree import ElementTree as ET

tree = ET.parse('test.xml')
root = tree.getroot()

name_space = '{http://www.sitemaps.org/schemas/sitemap/0.9}'

urls = []
for child in root.iter():
    for block in child.findall('{}url'.format(name_space)):
        for url in block.findall('{}loc'.format(name_space)):
            urls.append('{}\n'.format(url.text))

with open('sample_urls.txt', 'w+') as f:
    f.writelines(urls)

note we need to append the name space from the open urlset definition to properly parse the xml

0 讨论(0)

查看其它3个回答