Trying to write a python script to extract lines from a file. The file is a text file which is a dump of python suds output.
I want to:
Several suggestions on your code:
Stripping all non-alphanumeric characters is totally unnecessary and timewasting; there is no need whatsoever to build linelist
. Are you aware you can simply use plain old string.find("ArrayOf_xsd_string")
or re.search(...)
?
Then as to your regex, _
is already covered under \W
anyway. But the following reassignment to line overwrites the line you just read??
for line in f:
line = re.compile('[\W_]+') # overwrites the line you just read??
line.sub('', string.printable)
Here's my version, which reads the file directly, and also handles multiple matches:
with open('data.txt', 'r') as f:
theDict = {}
found = -1
for (lineno,line) in enumerate(f):
if found < 0:
if line.find('ArrayOf_xsd_string')>=0:
found = lineno
entries = []
continue
# Grab following 6 lines...
if 2 <= (lineno-found) <= 6+1:
entry = line.strip(' ""{}[]=:,')
entries.append(entry)
#then create a dict with the key from line 5
if (lineno-found) == 6+1:
key = entries.pop(4)
theDict[key] = entries
print key, ','.join(entries) # comma-separated, no quotes
#break # if you want to end on first match
found = -1 # to process multiple matches
And the output is exactly what you wanted (that's what ','.join(entries) was for):
123456 001,ABCD,1234,wordy type stuff,more stuff, etc
234567 002,ABCD,1234,wordy type stuff,more stuff, etc
345678 003,ABCD,1234,wordy type stuff,more stuff, etc