How to strip variable spaces in each line of a text file based on special condition - one-liner in Python?

问题

I have some data (text files) that is formatted in the most uneven manner one could think of. I am trying to minimize the amount of manual work on parsing this data.

Sample Data :

Name        Degree      CLASS       CODE        EDU     Scores
--------------------------------------------------------------------------------------
John Marshall       CSC   78659944   89989        BE   900
Think Code DB I10   MSC  87782  1231  MS            878
Mary 200 Jones    CIVIL      98993483  32985        BE       898
John G. S  Mech 7653 54 MS 65
Silent Ghost  Python Ninja 788505  88448  MS Comp  887

Conditions :

More than one spaces should be compressed to a delimiter (pipe better? End goal is to store these files in the database).
Except for the first column, the other columns won't have any spaces in them, so all those spaces can be compressed to a pipe.
Only the first column can have multiple words with spaces (Mary K Jones). The rest of the columns are mostly numbers and some alphabets.
First and second columns are both strings. They almost always have more than one spaces between them, so that is how we can differentiate between the 2 columns. (If there is a single space, that is a risk I am willing to take given the horrible formatting!).
The number of columns varies, so we don't have to worry about column names. All we want is to extract each column's data.

Hope I made sense! I have a feeling that this task can be done in a oneliner. I don't want to loop, loop, loop :(

Muchos gracias "Pythonistas" for reading all the way and not quitting before this sentence!

回答1:

It still seems tome that there's some format in your files:

>>> regex = r'^(.+)\b\s{2,}\b(.+)\s+(\d+)\s+(\d+)\s+(.+)\s+(\d+)'
>>> for line in s.splitlines():
    lst = [i.strip() for j in re.findall(regex, line) for i in j if j]
    print(lst)


[]
[]
['John Marshall', 'CSC', '78659944', '89989', 'BE', '900']
['Think Code DB I10', 'MSC', '87782', '1231', 'MS', '878']
['Mary 200 Jones', 'CIVIL', '98993483', '32985', 'BE', '898']
['John G. S', 'Mech', '7653', '54', 'MS', '65']
['Silent Ghost', 'Python Ninja', '788505', '88448', 'MS Comp', '887']

Regex is quite straightforward, the only things you need to pay attention to are the delimiters (\s) and the word breaks (\b) in case of the first delimiter. Note that when the line wouldn't match you get an empty list as lst. That would be a read flag to bring up the user interaction described below. Also you could skip the header lines by doing:

>>> file = open(fname)
>>> [next(file) for _ in range(2)]
>>> for line in file:
    ...  # here empty lst indicates issues with regex

Previous variants:

>>> import re
>>> for line in open(fname):
    lst = re.split(r'\s{2,}', line)
    l = len(lst)
    if l in (2,3):
        lst[l-1:] = lst[l-1].split()
    print(lst)

['Name', 'Degree', 'CLASS', 'CODE', 'EDU', 'Scores']
['--------------------------------------------------------------------------------------']
['John Marshall', 'CSC', '78659944', '89989', 'BE', '900']
['Think Code DB I10', 'MSC', '87782', '1231', 'MS', '878']
['Mary 200 Jones', 'CIVIL', '98993483', '32985', 'BE', '898']
['John G. S', 'Mech', '7653', '54', 'MS', '65']

another thing to do is simply allow user to decide what to do with questionable entries:

if l < 3:
    lst = line.split()
    print(lst)
    iname = input('enter indexes that for elements of name: ')     # use raw_input in py2k
    idegr = input('enter indexes that for elements of degree: ')

Uhm, I was all the time under the impression that the second element might contain spaces, since it's not the case you could just do:

>>> for line in open(fname):
    name, _, rest = line.partition('  ')
    lst = [name] + rest.split()
    print(lst)

回答2:

Variation on SilentGhost's answer, this time first splitting the name from the rest (separated by two or more spaces), then just splitting the rest, and finally making one list.

import re

for line in open(fname):
    name, rest = re.split('\s{2,}', line, maxsplit=1)
    print [name] + rest.split()

回答3:

This answer was written after the OP confessed to changing every tab ("\t") in his data to 3 spaces (and not mentioning it in his question).

Looking at the first line, it seems that this is a fixed-column-width report. It is entirely possible that your data contains tabs that if expanded properly might result in a non-crazy result.

Instead of doing line.replace('\t', ' ' * 3) try line.expandtabs().

Docs for expandtabs are here.

If the result looks sensible (columns of data line up), you will need to determine how you can work out the column widths programatically (if that is possible) -- maybe from the heading line.

Are you sure that the second line is all "-", or are there spaces between the columns? The reason for asking is that I once needed to parse many different files from a database query report mechanism which presented the results like this:

RecordType  ID1                  ID2         Description           
----------- -------------------- ----------- ----------------------
1           12345678             123456      Widget                
4           87654321             654321      Gizmoid

and it was possible to write a completely general reader that inspected the second line to determine where to slice the heading line and the data lines. Hint:

sizes = map(len, dash_line.split())

If expandtabs() doesn't work, edit your question to show exactly what you do have i.e. show the result of print repr(line) for the first 5 or so lines (including the heading line). It might also be useful if you could say what software produces these files.

来源：https://stackoverflow.com/questions/3874117/how-to-strip-variable-spaces-in-each-line-of-a-text-file-based-on-special-condit

标签

python

parsing

formatting

text-parsing

delimiter