How to split a DataFrame in pandas in predefined percentages?

后端未结

关注

 2  473

I have a pandas dataframe sorted by a number of columns. Now I\'d like to split the dataframe in predefined percentages, so as to extract and name a few segments.

相关标签:

2条回答

醉话见心

2020-12-19 01:06

Use numpy.split:

a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])

Sample:

np.random.seed(100)
df = pd.DataFrame(np.random.random((20,5)), columns=list('ABCDE'))
#print (df)

a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
print (a)
          A         B         C         D         E
0  0.543405  0.278369  0.424518  0.844776  0.004719
1  0.121569  0.670749  0.825853  0.136707  0.575093
2  0.891322  0.209202  0.185328  0.108377  0.219697
3  0.978624  0.811683  0.171941  0.816225  0.274074

print (b)
          A         B         C         D         E
4  0.431704  0.940030  0.817649  0.336112  0.175410
5  0.372832  0.005689  0.252426  0.795663  0.015255
6  0.598843  0.603805  0.105148  0.381943  0.036476
7  0.890412  0.980921  0.059942  0.890546  0.576901
8  0.742480  0.630184  0.581842  0.020439  0.210027
9  0.544685  0.769115  0.250695  0.285896  0.852395

print (c)
           A         B         C         D         E
10  0.975006  0.884853  0.359508  0.598859  0.354796
11  0.340190  0.178081  0.237694  0.044862  0.505431
12  0.376252  0.592805  0.629942  0.142600  0.933841
13  0.946380  0.602297  0.387766  0.363188  0.204345
14  0.276765  0.246536  0.173608  0.966610  0.957013
15  0.597974  0.731301  0.340385  0.092056  0.463498
16  0.508699  0.088460  0.528035  0.992158  0.395036
17  0.335596  0.805451  0.754349  0.313066  0.634037
18  0.540405  0.296794  0.110788  0.312640  0.456979
19  0.658940  0.254258  0.641101  0.200124  0.657625

0 讨论(0)

长发绾君心

2020-12-19 01:12

I've written a simple function that does the job.

Maybe that might help you.

P.S:

Sum of fractions must be 1.

It will return len(fracs) new dfs. so you can insert fractions list at long as you want (e.g: fracs=[0.1, 0.1, 0.3, 0.2, 0.2])

np.random.seed(100)
df = pd.DataFrame(np.random.random((99,4)))

def split_by_fractions(df:pd.DataFrame, fracs:list, random_state:int=42):
    assert sum(fracs)==1.0, 'fractions sum is not 1.0 (fractions_sum={})'.format(sum(fracs))
    remain = df.index.copy().to_frame()
    res = []
    for i in range(len(fracs)):
        fractions_sum=sum(fracs[i:])
        frac = fracs[i]/fractions_sum
        idxs = remain.sample(frac=frac, random_state=random_state).index
        remain=remain.drop(idxs)
        res.append(idxs)
    return [df.loc[idxs] for idxs in res]

train,test,val = split_by_fractions(df, [0.8,0.1,0.1]) # e.g: [test, train, validation]

print(train.shape, test.shape, val.shape)

outputs:

(79, 4) (10, 4) (10, 4)

0 讨论(0)