Parsing all XML files in directory and all subdirectories

问题

I am new to Python, yet I have some experience with Delphi. I am trying to make a script that would be able to search all xml files in directory (including all subdirectories in that directory), then parse those XML and save some data (numbers) from there to a simple txt file. After that I work through that txt file to create another txt file with only unique set of numbers from previously created txt file.

I created this script:

import os
from xml.dom import minidom

#for testing purposes
directory = os.getcwd()

print("Procházím aktuální adresář, hledám XML soubory...")
print("Procházím XML soubory, hledám IČP provádějícího...")

with open ('ICP_all.txt', 'w') as SeznamICP_all:   
    for root, dirs, files in os.walk(directory):
        for file in files:
            if (file.endswith('.xml')):
                xmldoc = minidom.parse(file)
                itemlist = xmldoc.getElementsByTagName('is')
                SeznamICP_all.write(itemlist[0].attributes['icp'].value + '\n')

print("Vytvářím list unikátních IČP...")

with open ('ICP_distinct.txt','w') as distinct:
    UnikatniICP = []
    with open ('ICP_all.txt','r') as SeznamICP_all:
        for line in SeznamICP_all:
            if line not in UnikatniICP:
                UnikatniICP.append(line)
                distinct.write(line)

print('Počet unikátních IČP:' + str(len(UnikatniICP)))
input('Pro ukončení stiskni libovolnou klávesu...')

It works as intented just until there is a subdirectory, in that case I get error:

FileNotFoundError: [Errno 2] No such file or directory: 'RNN38987.xml'

That is caused by the fact that file is in subdirectory, not in a directory with python script. I tried to make it work via path to get absolute path of the file to work with, but I am getting more error, see the script:

import os
from xml.dom import minidom
from pathlib import Path

#for testing purposes
directory = os.getcwd()

print("Procházím aktuální adresář, hledám XML soubory...")
print("Procházím XML soubory, hledám IČP provádějícího...")

with open ('ICP_all.txt', 'w') as SeznamICP_all:   
    for root, dirs, files in os.walk(directory):
        for file in files:
            if (file.endswith('.xml')):
                soubor = Path(file).resolve()
                print(soubor)
                xmldoc = minidom.parse(soubor)
                itemlist = xmldoc.getElementsByTagName('is')
                SeznamICP_all.write(itemlist[0].attributes['icp'].value + '\n')

print("Vytvářím list unikátních IČP...")

with open ('ICP_distinct.txt','w') as distinct:
    UnikatniICP = []
    with open ('ICP_all.txt','r') as SeznamICP_all:
        for line in SeznamICP_all:
            if line not in UnikatniICP:
                UnikatniICP.append(line)
                distinct.write(line)

print('Počet unikátních IČP:' + str(len(UnikatniICP)))
input('Pro ukončení stiskni libovolnou klávesu...')

The error I am getting now I don't really understand and google is not helping either - whole log:

Procházím aktuální adresář, hledám XML soubory...
Procházím XML soubory, hledám IČP provádějícího...
C:\2_Programming\Python\IČP FINDER\src\20150225_1815_2561_1.xml
Traceback (most recent call last):
  File "C:\2_Programming\Python\IČP FINDER\src\ICP Finder.py", line 17, in <module>
    xmldoc = minidom.parse(soubor)
  File "C:\2_Programming\Python\Interpreter\lib\xml\dom\minidom.py", line 1958, in parse
    return expatbuilder.parse(file)
  File "C:\2_Programming\Python\Interpreter\lib\xml\dom\expatbuilder.py", line 913, in parse
    result = builder.parseFile(file)
  File "C:\2_Programming\Python\Interpreter\lib\xml\dom\expatbuilder.py", line 204, in parseFile
    buffer = file.read(16*1024)
AttributeError: 'WindowsPath' object has no attribute 'read'

Can you please help me out?

回答1:

The pattern you are looking for is like:

with open ('ICP_all.txt', 'w') as SeznamICP_all:   
    for root, dirs, files in os.walk(directory):
        for file in files:
            if (file.endswith('.xml')):
                xmldoc = minidom.parse(os.path.join(root, file))
                itemlist = xmldoc.getElementsByTagName('is')
                SeznamICP_all.write(itemlist[0].attributes['icp'].value + '\n')

In each iteration of your for loop, root refers to the directory in which the files and dirs exist.

回答2:

Your issue as already explained in Rob's answer is because you are not joining the path so once you leave the cwd you are finding files outside of the directory but looking for them in the cwd.

Since you are using oythin3 you have a couple of other options to find the files, if your python3 version is 3.5 can find all the xml files using glob searching recursively:

import glob
import os
from xml.dom import minidom

directory = os.getcwd()

with open ('ICP_all.txt', 'w') as SeznamICP_all:
    for file in glob.iglob(directory+'/**/*xml', recursive=True):
            xmldoc = minidom.parse(file)
            itemlist = xmldoc.getElementsByTagName('is')
            SeznamICP_all.write(itemlist[0].attributes['icp'].value + '\n')

Or if you are using python 3.4, you can use pathlib to do a recursive search:

from pathlib import Path


with open ('ICP_all.txt', 'w') as SeznamICP_all:
    for file in Path(directory).glob('**/*.xml')
        xmldoc = minidom.parse(file)
        itemlist = xmldoc.getElementsByTagName('is')
        SeznamICP_all.write(itemlist[0].attributes['icp'].value + '\n')

来源：https://stackoverflow.com/questions/38211588/parsing-all-xml-files-in-directory-and-all-subdirectories

标签

python

xml

python-3.x

xml-parsing