How to split a DataFrame in pandas in predefined percentages?

后端 未结 2 473
不知归路
不知归路 2020-12-19 00:38

I have a pandas dataframe sorted by a number of columns. Now I\'d like to split the dataframe in predefined percentages, so as to extract and name a few segments.

F

相关标签:
2条回答
  • 2020-12-19 01:06

    Use numpy.split:

    a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
    

    Sample:

    np.random.seed(100)
    df = pd.DataFrame(np.random.random((20,5)), columns=list('ABCDE'))
    #print (df)
    
    a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
    print (a)
              A         B         C         D         E
    0  0.543405  0.278369  0.424518  0.844776  0.004719
    1  0.121569  0.670749  0.825853  0.136707  0.575093
    2  0.891322  0.209202  0.185328  0.108377  0.219697
    3  0.978624  0.811683  0.171941  0.816225  0.274074
    
    print (b)
              A         B         C         D         E
    4  0.431704  0.940030  0.817649  0.336112  0.175410
    5  0.372832  0.005689  0.252426  0.795663  0.015255
    6  0.598843  0.603805  0.105148  0.381943  0.036476
    7  0.890412  0.980921  0.059942  0.890546  0.576901
    8  0.742480  0.630184  0.581842  0.020439  0.210027
    9  0.544685  0.769115  0.250695  0.285896  0.852395
    
    print (c)
               A         B         C         D         E
    10  0.975006  0.884853  0.359508  0.598859  0.354796
    11  0.340190  0.178081  0.237694  0.044862  0.505431
    12  0.376252  0.592805  0.629942  0.142600  0.933841
    13  0.946380  0.602297  0.387766  0.363188  0.204345
    14  0.276765  0.246536  0.173608  0.966610  0.957013
    15  0.597974  0.731301  0.340385  0.092056  0.463498
    16  0.508699  0.088460  0.528035  0.992158  0.395036
    17  0.335596  0.805451  0.754349  0.313066  0.634037
    18  0.540405  0.296794  0.110788  0.312640  0.456979
    19  0.658940  0.254258  0.641101  0.200124  0.657625
    
    0 讨论(0)
  • 2020-12-19 01:12

    I've written a simple function that does the job.

    Maybe that might help you.

    P.S:

    • Sum of fractions must be 1.
    • It will return len(fracs) new dfs. so you can insert fractions list at long as you want (e.g: fracs=[0.1, 0.1, 0.3, 0.2, 0.2])

      np.random.seed(100)
      df = pd.DataFrame(np.random.random((99,4)))
      
      def split_by_fractions(df:pd.DataFrame, fracs:list, random_state:int=42):
          assert sum(fracs)==1.0, 'fractions sum is not 1.0 (fractions_sum={})'.format(sum(fracs))
          remain = df.index.copy().to_frame()
          res = []
          for i in range(len(fracs)):
              fractions_sum=sum(fracs[i:])
              frac = fracs[i]/fractions_sum
              idxs = remain.sample(frac=frac, random_state=random_state).index
              remain=remain.drop(idxs)
              res.append(idxs)
          return [df.loc[idxs] for idxs in res]
      
      train,test,val = split_by_fractions(df, [0.8,0.1,0.1]) # e.g: [test, train, validation]
      
      print(train.shape, test.shape, val.shape)
      

      outputs:

      (79, 4) (10, 4) (10, 4)
      
    0 讨论(0)
提交回复
热议问题