extract keywords form images using python

老子叫甜甜 提交于 2019-12-11 06:09:35

问题


still learning python. I am currently working on a python code that will extracts metadata (usermade keywords) from images. I already tried Pillow AND exif but this excludes the user made tags or keywords. With applist, i successfully managed to extract the metafile including my keywords but when I tried to purse it with ElementTree to extract the parts of interest, I obtain only empty data.

My xml file look like this (after some manipulation):

<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 4.4.0">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/">
         <dc:description>
            <rdf:Seq>
               <rdf:li xml:lang="x-default">South Carolina, Olivyana, Kumasi</rdf:li>
            </rdf:Seq>
         </dc:description>
         <dc:subject>
            <rdf:Bag>
               <rdf:li>Kumasi</rdf:li>
               <rdf:li>Summer 2016</rdf:li>
               <rdf:li>Charlestone</rdf:li>
               <rdf:li>SC</rdf:li>
               <rdf:li>Beach</rdf:li>
               <rdf:li>Olivjana</rdf:li>
            </rdf:Bag>
         </dc:subject>
         <dc:title>
            <rdf:Seq>
               <rdf:li xml:lang="x-default">P1050365</rdf:li>
            </rdf:Seq>
         </dc:title>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:aux="http://ns.adobe.com/exif/1.0/aux/">
         <aux:SerialNumber>F360908190331</aux:SerialNumber>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>

My code looks like this:

import xml.etree.ElementTree as ET
from PIL import Image, ExifTags
with Image.open("myfile.jpg") as im:
    for segment, content in im.applist:
        marker, body = content.split(b'\x00', 1)
        if segment == 'APP1' and marker == b'http://ns.adobe.com/xap/1.0/':
            data = body.decode('"utf-8"')
print (data)

at this point it was't possible to pass this to the parser as there is an empty line returning an error:

tree = ET.parse(data)

ValueError: embedded null byte

so after removing it i saved the data in a xml file (the xml data above) and passed to the parser but obtaining no data:

tree = ET.parse('mytags.xml')
tags = tree.findall('xmpmeta/RDF/Description/subject/Bags')
print (type(tags))
print (len(tags))

<class 'list'>
0

Interestingly, it I used the tags in the form of the xml file (i.e. 'x:xmpmeta':), I receive the following error:

SyntaxError: prefix 'x' not found in prefix map

Thanks for your help.

Fabio


回答1:


Focusing only on your XML parsing not PIL metadata work, three issues are your problem:

  1. You need to define the namespace prefixes when using findall which you can do with the namespaces arg. And then your xpath must include the prefixes.
  2. When using findall do not include the root as that is the starting point but from its child downward.
  3. There is no Bags local name with plural but only Bag and its length would be one. If you want its children, go one level deeper.

Consider adjusted script:

import xml.etree.ElementTree as ET

tree = ET.parse('mytags.xml')

nmspdict = {'x':'adobe:ns:meta/',            
            'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
            'dc': 'http://purl.org/dc/elements/1.1/'}

tags = tree.findall('rdf:RDF/rdf:Description/dc:subject/rdf:Bag/rdf:li',
                    namespaces = nmspdict)

print (type(tags))
print (len(tags))

# <class 'list'>
# 6

for i in tags:
    print(i.text)
# Kumasi
# Summer 2016
# Charlestone
# SC
# Beach
# Olivjana


来源:https://stackoverflow.com/questions/42892405/extract-keywords-form-images-using-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!