Reading files with multiple delimiter in column headers and skipping some rows at the end

问题

I am new to Python and I would like to use pandas for reading the data. I have done some searching and effort to solve my issue but still I am struggling. thanks for your help in advance!

I have a.txt file looking like this;

skip1
 A1| A2 |A3 |A4# A5# A6 A7| A8 , A9
1,2,3,4,5,6,7,8,9
1,2,3,4,5,6,7,8,9
1,2,3,4,5,6,7,8,9

END***
Some other data starts from here

The first task is that I would like to assign A1,A2,A3,A4,A5,A6,A7,A8 and A9 as column names. However, there are multiple separators such as ' ','|','#' and this makes hassle to assign separator when reading the file. I tried like this;

import pandas as pd
import glob
filelist=glob.glob('*.txt')
print(filelist)

df = pd.read_csv(filelist,skiprows=1,skipfooter=2,skipinitialspace=True, header=0, sep=r'\| |,|#',engine='python')

But it seems that nothing is happened when I check Spyder's data explorer df.

The second task is that during the reading removing the data starting with the rows END*** that I don't need. The header has always the same length. However, skipfooter needs the number of lines to skip, which should be changed between the files.

Some several questions already been asked but It seems I can't make them work on my question!

how-to-read-txt-file-in-pandas-with-multiple-delimiters

pandas-read-delimited-file?

import-text-to-pandas-with-multiple-delimiters

pandas-ignore-all-lines-following-a-specific-string-when-reading-a-file-into-a

EDIT: about removing the the reading removing the data starting with the rows END

If the b.txt file like this b.txt

skip1
 A1| A2 |A3 |A4# A5# A6 A7| A8 , A9
1,2,3,4,5,6,7,8,9
1,2,3,4,5,6,7,8,9
1,2,3,4,5,6,7,8,9

END123
Some other data starts from here

an by using the second solution below;

txt = open('b.txt').read().split('\nEND')[0]
_, h, txt = txt.split('\n', 2)
pat = r'[\|, ,#,\,]+'
names = re.split(pat, h.strip())

pd.read_csv(
    pd.io.common.StringIO(txt),
    names=names, header=None,
    engine='python')

Getting this,

   A1  A2  A3  A4  A5  A6  A7  A8  A9
0   1   2   3   4   5   6   7   8   9
1   1   2   3   4   5   6   7   8   9
2   1   2   3   4   5   6   7   8   9

回答1:

Split the file, then read from string

txt = open('test.txt').read().split('\nEND***')[0]
pd.read_csv(
    pd.io.common.StringIO(txt),
    sep=r'\W+',
    skiprows=1, engine='python')

   A1  A2  A3  A4  A5  A6  A7  A8  A9
0   1   2   3   4   5   6   7   8   9
1   1   2   3   4   5   6   7   8   9
2   1   2   3   4   5   6   7   8   9

We can be very explicit with the parsing of the header and parse the rest of the file as csv

txt = open('test.txt').read().split('\nEND***')[0]
_, h, txt = txt.split('\n', 2)
pat = r'[\|, ,#,\,]+'
names = re.split(pat, h.strip())

pd.read_csv(
    pd.io.common.StringIO(txt),
    names=names, header=None,
    engine='python')

   A1  A2  A3  A4  A5  A6  A7  A8  A9
0   1   2   3   4   5   6   7   8   9
1   1   2   3   4   5   6   7   8   9
2   1   2   3   4   5   6   7   8   9

回答2:

answering first question:

In [182]: df = pd.read_csv(filename, sep=r'\s*(?:\||\#|\,)\s*', 
                           skiprows=1, engine='python')

In [183]: df
Out[183]:
   A1  A2  A3  A4  A5  A6 A7  A8  A9
1   2   3   4   5   6      7   8   9
1   2   3   4   5   6      7   8   9
1   2   3   4   5   6      7   8   9

来源：https://stackoverflow.com/questions/45695040/reading-files-with-multiple-delimiter-in-column-headers-and-skipping-some-rows-a

标签

python

pandas

delimiter

delimited-text