问题
I'm trying to create a function which counts words in pptx
document. The problem is that I can't figure out how to find only this kind of tags:
<a:t>Some Text</a:t>
When I try to: print xmlTree.findall('.//a:t')
, it returns
SyntaxError: prefix 'a' not found in prefix map
Do you know what to do to make it work?
This is the function:
def get_pptx_word_count(filename):
import xml.etree.ElementTree as ET
import zipfile
z = zipfile.ZipFile(filename)
i=0
wordcount = 0
while True:
i+=1
slidename = 'slide{}.xml'.format(i)
try:
slide = z.read("ppt/slides/{}".format(slidename))
except KeyError:
break
xmlTree = ET.fromstring(slide)
for elem in xmlTree.iter():
if elem.tag=='a:t':
#text = elem.getText
#num = len(text.split(' '))
#wordcount+=num
回答1:
You need to tell ElementTree
about your XML namespaces.
References:
- Official Documentation (Python 2.7): 19.7.1.6. Parsing XML with Namespaces
- Existing answer on StackOverflow: Parsing XML with namespace in Python via 'ElementTree'
- Article by the author of ElementTree: ElementTree: Working with Namespaces and Qualified Names
回答2:
The way to specify the namespace inside ElementTree is:
{namespace}element
So, you should change your query to:
print xmlTree.findall('.//{a}t')
Edit:
As @mxjn pointed out if a is a prefix and not the URI you need to insert the URI instead of a:
print xmlTree.findall('.//{http://tempuri.org/name_space_of_a}t')
or you can supply a prefix map:
prefix_map = {"a": "http://tempuri.org/name_space_of_a"}
print xmlTree.findall('.//a:t', prefix_map)
来源:https://stackoverflow.com/questions/40772297/syntaxerror-prefix-a-not-found-in-prefix-map