I have a pandas dataframe sorted by a number of columns. Now I\'d like to split the dataframe in predefined percentages, so as to extract and name a few segments.
F
Use numpy.split:
a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.random((20,5)), columns=list('ABCDE'))
#print (df)
a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
print (a)
A B C D E
0 0.543405 0.278369 0.424518 0.844776 0.004719
1 0.121569 0.670749 0.825853 0.136707 0.575093
2 0.891322 0.209202 0.185328 0.108377 0.219697
3 0.978624 0.811683 0.171941 0.816225 0.274074
print (b)
A B C D E
4 0.431704 0.940030 0.817649 0.336112 0.175410
5 0.372832 0.005689 0.252426 0.795663 0.015255
6 0.598843 0.603805 0.105148 0.381943 0.036476
7 0.890412 0.980921 0.059942 0.890546 0.576901
8 0.742480 0.630184 0.581842 0.020439 0.210027
9 0.544685 0.769115 0.250695 0.285896 0.852395
print (c)
A B C D E
10 0.975006 0.884853 0.359508 0.598859 0.354796
11 0.340190 0.178081 0.237694 0.044862 0.505431
12 0.376252 0.592805 0.629942 0.142600 0.933841
13 0.946380 0.602297 0.387766 0.363188 0.204345
14 0.276765 0.246536 0.173608 0.966610 0.957013
15 0.597974 0.731301 0.340385 0.092056 0.463498
16 0.508699 0.088460 0.528035 0.992158 0.395036
17 0.335596 0.805451 0.754349 0.313066 0.634037
18 0.540405 0.296794 0.110788 0.312640 0.456979
19 0.658940 0.254258 0.641101 0.200124 0.657625
I've written a simple function that does the job.
Maybe that might help you.
P.S:
It will return len(fracs) new dfs. so you can insert fractions list at long as you want (e.g: fracs=[0.1, 0.1, 0.3, 0.2, 0.2])
np.random.seed(100)
df = pd.DataFrame(np.random.random((99,4)))
def split_by_fractions(df:pd.DataFrame, fracs:list, random_state:int=42):
assert sum(fracs)==1.0, 'fractions sum is not 1.0 (fractions_sum={})'.format(sum(fracs))
remain = df.index.copy().to_frame()
res = []
for i in range(len(fracs)):
fractions_sum=sum(fracs[i:])
frac = fracs[i]/fractions_sum
idxs = remain.sample(frac=frac, random_state=random_state).index
remain=remain.drop(idxs)
res.append(idxs)
return [df.loc[idxs] for idxs in res]
train,test,val = split_by_fractions(df, [0.8,0.1,0.1]) # e.g: [test, train, validation]
print(train.shape, test.shape, val.shape)
outputs:
(79, 4) (10, 4) (10, 4)