Text file creation issue where new lines created when not really EOL

问题

I am importing some text data from a set of files I have created in python (converting metadata/xml records to text) into excel. It mostly works fine except that there are new lines inserted at points where the text is simply in a paragraph. This is an issue in the file creation process.

Is it possible to clean the data automatically to maintain data in the same row until it meets an escape/new character?

As this site doesn't allow attachments I have attached examples here.

anz*_log.txt -- Raw text file where I am using "^" as the delimiter. I can force it to add another character at the end of each known row if excel can use this to only create new lines when this exists.
anz*_xml.xls Excel Import - worksheet (*log) the raw import data) and cleaned where I have used formulas to get the values in properly.
rowChar_anz*log.txt - Raw text file with ':;:' at start of each row to show that it should be a new row (same as 1 but with additional delimiter for row)

This is just on a test dataset and I need to run this on 1000's of files. See the issues in row 9,13, 54 etc.

Can I use python (or if necessary cygwing/SED) to

Look for the "start of line" string - ':;:' and "end of line" string ';:;'
If both don't exist in a single row then append line to previous row

Alternatively (and ideally) Can this be done while the file is being created using the following code? maybe using re.compile (as in Query CSV and write original CSV and results to single CSV Python)?

#-------------------------------------------------------------------------------
# Name:        Convert xml data to csv with anzlic tagged data kept seperate
# Purpose:  Also has an excel template to convert the data into standard columns
#
# Author:      georgec@atgis.com.au
#
# Created:     05/03/2013
# Copyright:   (c) ATGIS. georgec 2013
# Licence:     Creative Commons
#-------------------------------------------------------------------------------

import os, xml, shutil, datetime
from xml.etree import ElementTree as et

SourceDIR=r'L:\Vector_Data'
rootDir=os.getcwd()
log_name='vector'
x=0

def locatexml(SourceDIR,x, rootDir):
    xmllist=[]
    for root, dirs, files in os.walk(SourceDIR, topdown=False):
        for fl in files:
            currentFile=os.path.join(root, fl)
            ext=fl[fl.rfind('.')+1:]
            if ext=='xml':
                xmllist.append(currentFile)
                print currentFile
                x+=1
                try:
                    processxml(currentFile,x, rootDir)
                except:
                    print "Issue with file: "+ currentFile
                    log=open(rootDir+'\\'+log_name+'issue_xml_log.txt','a')
                    log.write(str(x)+'^'+currentFile+'\n')
                    log.close

    print "finished"
    return xmllist, x, currentFile

def processxml(currentFile,x, rootDir):
    from lxml import etree
    seperator='^'
    with open(currentFile) as f:
        tree = etree.parse(f)
    xmltaglist=[]
    for tagn in tree.iter(tag=None):
        #print tagn.tag
        xmltaglist.append(tagn.tag)
    if 'anzmeta' in str(tree.getroot()):
        log=open(rootDir+'\\'+log_name+'anzmeta_xml_log.txt','a')
        log.write(':;:'+seperator+str(x)+seperator+currentFile+seperator)
        for xmltag in xmltaglist:
            for element in tree.iter(xmltag):
                #print element[x]
                for child in element.getchildren():
                    print "{0.tag}: {0.text}".format(child)
                    log.write("{0.tag}".format(child)+"::"+"{0.text}".format(child)+seperator)
        log.write('\n')
        log.close
    else:
        print currentFile+" not an anzlic metadata file...logging seperately"
        log=open(rootDir+'\\'+log_name+'non_anzmeta_xml_log.txt','a')
        log.write(':;:'+seperator+str(x)+seperator+currentFile+seperator)
        for xmltag in xmltaglist:
            for element in tree.iter(xmltag):
                #print element[x]
                for child in element.getchildren():
                    print "{0.tag}: {0.text}".format(child)
                    log.write("{0.tag}".format(child)+"::"+"{0.text}".format(child)+seperator)
        log.write('\n')
        log.close

locatexml(SourceDIR,x, rootDir)

回答1:

Found the answer....simply added .replace('\n','') to command that wrote each entry. Should have thought of this a few hours ago!!!

来源：https://stackoverflow.com/questions/15283801/text-file-creation-issue-where-new-lines-created-when-not-really-eol

标签

python

xml

regex

csv

cygwin