Variable number of unwanted white spaces resulting into distorted column

问题

Recently, I asked the following question - Unwanted white spaces resulting into distorted column and the answer by @sharathnatraj was satisfactory and worked like a charm.

Answer was:

import re
with open('trial1.txt', 'r') as f:
    lines = f.readlines()
l = [re.sub(r"([a-z]{5,})\s([a-z]{5,})", r"\1\2", line) for line in lines] 
df = pd.read_csv(io.StringIO('\n'.join(l)), delim_whitespace=True)

Sample data set:

    1 CAgF3O3S silver trifluoromethanesulfonate 2923-28-6 256.937 629.15 1 --- --- --- --- --- --- --- ---
    2 CAgN silver cyanide 506-64-9 133.884 >573.15 1 --- --- --- --- --- --- --- ---
    3 CAgNO silver cyanate 3315-16-0 149.883 --- --- --- --- --- --- --- --- --- ---
    4 CAgNS silver-i- thiocyanate 1701-93-5 165.950 --- --- --- --- --- --- --- --- --- ---
    5 CAgN3O6 silver trinitromethanide 25987-94-4 257.894 370.95 1 --- --- --- --- --- --- --- ---
    6 CAgN3S2 silver azidodithioformate 74093-43-9 226.030 --- --- 1154.15 3 --- --- --- --- --- ---
    7 CAg2Cl3O3P silver trichloromethanephosphonate --- 413.073 --- --- --- --- --- --- --- --- --- ---
    8 CAg2N2 disilver cyanamide --- 255.757 --- --- --- --- --- --- --- --- --- ---
    9 CAg2O3 silver carbonate 534-16-7 275.741 487.15 1 --- --- --- --- --- --- --- ---
    10 CAsCl2F3 dichloro-trifluoro-methyl-arsine 421-32-9 214.833 --- --- 353.30 3 --- --- --- --- --- ---
    11 CAuN gold-i- cyanide 506-65-0 222.985 --- --- --- --- --- --- --- --- --- ---
    12 CB4 boron carbide 12069-32-8 55.255 2623.15 1 3773.15 3 --- --- --- --- --- ---
    13 CBaO3 barium carbonate 513-77-9 197.336 811.00 1 1723.15 3 --- --- --- --- --- ---
    14 CBrClF2 bromochlorodifluoromethane 353-59-3 165.365 113.65 1 270.60 1 25 1.8100 1 25 1.3371 2
    15 CBrClN2O4 bromochlorodinitromethane 33829-48-0 219.379 282.45 1 --- --- 20 2.3040 3 25 1.5710 2
    16 CBrCl2F bromodichlorofluoromethane 353-58-2 181.819 113.65 1 325.90 1 25 1.6960 3 25 1.5755 2
    17 CBrCl3 bromotrichloromethane 75-62-7 198.273 252.15 1 376.65 1 25 1.9940 1 25 1.5060 2
    18 CBrFO carbonic bromide fluoride 753-56-0 126.913 --- --- 252.59 3 --- --- --- 25 1.5660 2

However, I realised that above solution is working when there are 2 spaces in the chemical names, and when there are more than 2 spaces (for example row 18) the columns were distorted.

Thus, I tried modifying as below but it it not working

l = [re.sub(r"([a-z]{5,})\s([a-z]{5,})\s([a-z]{5,})", r"\1\2\3", line) for line in lines]

With this solution, row 18 is fixed but distorted other rows (ex 1 to 5)

In my dataset, there are rows where chemical names have up to 4 spaces (not shown here).

Thus, I was wondering if there is any fix to this problem.

回答1:

So it seems the name column should collect all strings until we get something that does look like a number or a bunch of minus signs. My approach would be this:

import re
import pandas as pd

numeric = re.compile("[0-9-]+")
sep = "|"

if __name__ == "__main__":
    with open('trial1.txt', 'r') as f:
        with open('tmp.txt', 'w') as tmp_file:
            for line_no, line in enumerate(f, start=1):
                raw_cols = line.split(" ")
                fixed_cols = []
                merging = False

                for i, raw_col in enumerate(raw_cols):
                    col = raw_col
                    if numeric.match(col):
                        merging = False
                    if merging:
                        fixed_cols[2] += " " + col
                    else:
                        fixed_cols.append(col)

                    if i == 2 and line_no > 1:
                        merging = True

                tmp_file.write(sep.join(fixed_cols))

    df = pd.read_csv(open("tmp.txt"), sep=sep)

    print(df)

I assume there are no pipe | symbols in the file. The temporary result is stored in file tmp.txt. When merging the columns, I add an additional blank fixed_cols[2] += " " + col.

回答2:

You can try the following solution, similar with the 2nd one in this question (that one was mine too):

Unwanted white spaces resulting into distorted column

with open ('trial1.txt') as f:
    l=f.readlines()

l=[i.split() for i in l]
target=len(l[1])
for i in range(1,len(l)):
    if len(l[i])>target:
        l[i][2]=l[i][2]+' '+l[i][3]
        l[i].pop(3)
l=['#'.join(k) for k in l] #supposing that there is no '#' in your entire file, otherwise use some other rare symbol that doesn't eist in your file
l=[i+'\n' for i in l]
 
with open ('trial2.txt', 'w') as f:
    f.writelines(l)

df = pd.read_csv('trial2.txt', sep='#', index_col=0)

Some additional notes:

Take care of the target. I used first row as correct length, in case that this row is not correct, you must use some other row, or even better, assign target manually. In your case is 14.
In case that you have more spaces that split your element of the 3rd column into more than 2 columns, you can use the same logic as:
```
 if len(l[i])>target:
     l[i][2]=l[i][2]+' '+l[i][3]
     l[i].pop(3)
```

For example if lenght in 16 which means that 3d column is split into 3 parts, you can use this:

if len(l[i])==16: l[i][2]=l[i][2]+' '+l[i][3]+' '+l[i][4] l[i].pop(4) l[i].pop(3)

And combine all these in one if statements, like below:

if len(l[i])==16:
    l[i][2]=l[i][2]+' '+l[i][3]+' '+l[i][4]
    l[i].pop(4)
    l[i].pop(3)
elif len(l[i])==15:
    l[i][2]=l[i][2]+' '+l[i][3]
    l[i].pop(3)

You can add as many if above this code, for length==17, length=18, etc

来源：https://stackoverflow.com/questions/65456221/variable-number-of-unwanted-white-spaces-resulting-into-distorted-column

标签

python

pandas

string

dataframe

csv