Concatenation data in dataframe

问题

Hi the following code only gives me the data from the last file.There is a issue with the concat or the for loop. I am reading data from 2 files. each should contain nearly 350 rows 3 columns that agree the condition in the for loop. So at the end data frame should give nearly 700 by 3 data frame. but it only shows data from the last file.

import glob
from pathlib import Path

path = Path(r'C:\Users\PC\Desktop\datafiles')
filenames = path.glob('*.txt')
toconcat = []
for i in filenames:
    data1 = pd.read_csv(i, sep="\t", header=None)
    data1.columns = ['number','ab','cd','as','sd','dfg']
    dataset1 = pd.DataFrame(data1.loc[data1.number==1,['number','ab','cd']])
    toconcat.append(dataset1)

result = pd.concat(toconcat)
result

But when i used the result.shape it shows 700 by 3 what is the issue here?

回答1:

I created even a "wider" example, passing also keys parameter (the "origin marker").

Source file Input_1.txt:

1   ab1 cd1 as1 sd1 dfg1
1   ab2 cd2 as2 sd2 dfg2
1   ab3 cd3 as3 sd3 dfg3
2   ab4 cd4 as4 sd4 dfg4

Source file Input_2.txt:

1   ab5 cd5 as5 sd5 dfg5
1   ab6 cd6 as6 sd6 dfg6
1   ab7 cd7 as7 sd7 dfg7
2   ab8 cd8 as8 sd8 dfg8

(both the above files are Tab-separated).

And the code:

toconcat = []
keys = []
path = Path(r'C:\Users\...')  # Replace dots with your path
filenames = path.glob('*.txt')
for i in filenames:
    data1 = pd.read_csv(i, sep='\t', names=['number', 'ab', 'cd', 'as', 'sd', 'dfg'])
    dataset1 = data1.loc[data1.number==1, ['number', 'ab', 'cd']]
    toconcat.append(dataset1)
    keys.append(i.stem)
result = pd.concat(toconcat, keys=keys)
print(result)

Note that column names can be passed as early as in read_csv (as I did).

The result, for my input files, is:

           number   ab   cd
Input_1 0       1  ab1  cd1
        1       1  ab2  cd2
        2       1  ab3  cd3
Input_2 0       1  ab5  cd5
        1       1  ab6  cd6
        2       1  ab7  cd7

So your code looks OK. My code is different in only this detail that the result contains MultiIndex, with the top level showing the origin of each row, thus easing tracing of what has been going on.

Try just my code and my input files, the result should be just like mine.

Then replace one of my files with yours (and run the code). Finally replace also the second my file with your second file and run the code again.

Finally delete keys parameter, to have an ordinary index in the result.

Probably the source of your error is somewhere else.

By the way: You don't need import glob, as you use glob from path only.

来源：https://stackoverflow.com/questions/61650990/concatenation-data-in-dataframe

标签

python