问题
This is my txt file:
In File Name: C:\Users\naqushab\desktop\files\File 1.m1
Out File Name: C:\Users\naqushab\desktop\files\Output\File 1.m2
In File Size: Low: 22636 High: 0
Total Process time: 1.859000
Out File Size: Low: 77619 High: 0
In File Name: C:\Users\naqushab\desktop\files\File 2.m1
Out File Name: C:\Users\naqushab\desktop\files\Output\File 2.m2
In File Size: Low: 20673 High: 0
Total Process time: 3.094000
Out File Size: Low: 94485 High: 0
In File Name: C:\Users\naqushab\desktop\files\File 3.m1
Out File Name: C:\Users\naqushab\desktop\files\Output\File 3.m2
In File Size: Low: 66859 High: 0
Total Process time: 3.516000
Out File Size: Low: 217268 High: 0
I am trying to parse this to an XML format like this:
<?xml version='1.0' encoding='utf-8'?>
<root>
<filedata>
<InFileName>File 1.m1</InFileName>
<OutFileName>File 1.m2</OutFileName>
<InFileSize>22636</InFileSize>
<OutFileSize>77619</OutFileSize>
<ProcessTime>1.859000</ProcessTime>
</filedata>
<filedata>
<InFileName>File 2.m1</InFileName>
<OutFileName>File 2.m2</OutFileName>
<InFileSize>20673</InFileSize>
<OutFileSize>94485</OutFileSize>
<ProcessTime>3.094000</ProcessTime>
</filedata>
<filedata>
<InFileName>File 3.m1</InFileName>
<OutFileName>File 3.m2</OutFileName>
<InFileSize>66859</InFileSize>
<OutFileSize>217268</OutFileSize>
<ProcessTime>3.516000</ProcessTime>
</filedata>
</root>
Here is the code (I am using Python 2) in which I am trying to achieve that:
import re
import xml.etree.ElementTree as ET
rex = re.compile(r'''(?P<title>In File Name:
|Out File Name:
|In File Size: Low:
|Total Process time:
|Out File Size: Low:
)
(?P<value>.*)
''', re.VERBOSE)
root = ET.Element('root')
root.text = '\n' # newline before the celldata element
with open('Performance.txt') as f:
celldata = ET.SubElement(root, 'filedata')
celldata.text = '\n' # newline before the collected element
celldata.tail = '\n\n' # empty line after the celldata element
for line in f:
# Empty line starts new celldata element (hack style, uggly)
if line.isspace():
celldata = ET.SubElement(root, 'filedata')
celldata.text = '\n'
celldata.tail = '\n\n'
# If the line contains the wanted data, process it.
m = rex.search(line)
if m:
# Fix some problems with the title as it will be used
# as the tag name.
title = m.group('title')
title = title.replace('&', '')
title = title.replace(' ', '')
e = ET.SubElement(celldata, title.lower())
e.text = m.group('value')
e.tail = '\n'
# Display for debugging
ET.dump(root)
# Include the root element to the tree and write the tree
# to the file.
tree = ET.ElementTree(root)
tree.write('Performance.xml', encoding='utf-8', xml_declaration=True)
But I am getting empty values, is it possible to parse this txt to XML?
回答1:
A correction with your regex: It should be
m = re.search('(?P<title>(In File Name)|(Out File Name)|(In File Size: *Low)|(Total Process time)|(Out File Size: *Low)):(?P<value>.*)',line)
and not as what you've given. Because in your regex, In File Name|Out File Name
means, it will check for In File Nam
followed but e
or O
followed by ut File Name
and so on.
Suggestion,
You can do it without using regex. xml.dom.minidom can be used for prettifying your xml string.
I've added the comments inline for better understanding!
Node.toprettyxml([indent=""[, newl=""[, encoding=""]]])
Return a pretty-printed version of the document. indent specifies the indentation string and defaults to a tabulator; newl specifies the string emitted at the end of each line and defaults to
Edit
import itertools as it [line[0] for line in it.groupby(lines)]
you can use groupby of itertools package to group consucutive dedup in list lines
So,
import xml.etree.ElementTree as ET
root = ET.Element('root')
with open('file1.txt') as f:
lines = f.read().splitlines()
#add first subelement
celldata = ET.SubElement(root, 'filedata')
import itertools as it
#for every line in input file
#group consecutive dedup to one
for line in it.groupby(lines):
line=line[0]
#if its a break of subelements - that is an empty space
if not line:
#add the next subelement and get it as celldata
celldata = ET.SubElement(root, 'filedata')
else:
#otherwise, split with : to get the tag name
tag = line.split(":")
#format tag name
el=ET.SubElement(celldata,tag[0].replace(" ",""))
tag=' '.join(tag[1:]).strip()
#get file name from file path
if 'File Name' in line:
tag = line.split("\\")[-1].strip()
elif 'File Size' in line:
splist = filter(None,line.split(" "))
tag = splist[splist.index('Low:')+1]
#splist[splist.index('High:')+1]
el.text = tag
#prettify xml
import xml.dom.minidom as minidom
formatedXML = minidom.parseString(
ET.tostring(
root)).toprettyxml(indent=" ",encoding='utf-8').strip()
# Display for debugging
print formatedXML
#write the formatedXML to file.
with open("Performance.xml","w+") as f:
f.write(formatedXML)
Output: Performance.xml
<?xml version="1.0" encoding="utf-8"?>
<root>
<filedata>
<InFileName>File 1.m1</InFileName>
<OutFileName>File 1.m2</OutFileName>
<InFileSize>22636</InFileSize>
<TotalProcesstime>1.859000</TotalProcesstime>
<OutFileSize>77619</OutFileSize>
</filedata>
<filedata>
<InFileName>File 2.m1</InFileName>
<OutFileName>File 2.m2</OutFileName>
<InFileSize>20673</InFileSize>
<TotalProcesstime>3.094000</TotalProcesstime>
<OutFileSize>94485</OutFileSize>
</filedata>
<filedata>
<InFileName>File 3.m1</InFileName>
<OutFileName>File 3.m2</OutFileName>
<InFileSize>66859</InFileSize>
<TotalProcesstime>3.516000</TotalProcesstime>
<OutFileSize>217268</OutFileSize>
</filedata>
</root>
Hope it helps!
回答2:
From the docs (emphasis is mine):
re.VERBOSE
This flag allows you to write regular expressions that look nicer. Whitespace within the pattern is ignored, except when in a character class or preceded by an unescaped backslash, and, when a line contains a '#' neither in a character class or preceded by an unescaped backslash, all characters from the leftmost such '#' through the end of the line are ignored.
escape spaces in the regex or use \s
class
来源:https://stackoverflow.com/questions/42835956/how-to-parse-a-txt-file-into-xml