发表新帖

发表新帖

Parsing text from XML node in Python

前端未结

关注

 3  910

傲寒 2020-12-02 02:36

I\'m trying to extract URLs from a sitemap like this: https://www.bestbuy.com/sitemap_c_0.xml.gz

I\'ve unzipped and saved the .xml.gz file as an .xml file. The struc

3条回答

执笔经年 (楼主)

2020-12-02 03:36
You were close in your attempt but like mzjn said in a comment, you didn't account for the default namespace (xmlns="http://www.sitemaps.org/schemas/sitemap/0.9").

Here's an example of how to account for the namespace:
```
import xml.etree.ElementTree as ET
tree = ET.parse('my_local_filepath')

ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}

for elem in tree.findall(".//sm:loc", ns):
    print(elem.text)
```
output:
```
https://www.bestbuy.com/
https://www.bestbuy.com/site/3d-printers/3d-printer-filament/pcmcat335400050008.c?id=pcmcat335400050008
https://www.bestbuy.com/site/3d-printers/3d-printing-accessories/pcmcat748300527647.c?id=pcmcat748300527647
```
Note that I used the namespace prefix sm, but you could use any NCName.

See here for more information on parsing XML with namespaces in ElementTree.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题