Split DataFrame Randomly (dependent on unique values)

问题

I have a DataFrame df that looks like this:

|  A    |  B  | ... |
---------------------
| one   | ... | ... |
| one   | ... | ... |
| one   | ... | ... |
| two   | ... | ... |
| three | ... | ... |
| three | ... | ... |
| four  | ... | ... |
| five  | ... | ... |
| five  | ... | ... |

As you can see for A there are 5 unique values. I want to split the DataFrame randomly. For example I want 3 unique values in DataFrame df1 and 2 unique values in DataFrame df2. My problem is that they aren't unique. I don't want to split these unique values over two DataFrames.

So the resulting DataFrame could look like this:

DataFrame df1 with 3 unique values:

|  A    |  B  | ... |
---------------------
| one   | ... | ... |
| one   | ... | ... |
| one   | ... | ... |
| three | ... | ... |
| three | ... | ... |
| five  | ... | ... |
| five  | ... | ... |

DataFrame df2 with 2 unique values:

|  A    |  B  | ... |
---------------------
| two   | ... | ... |
| four  | ... | ... |

Is there anyway how to achieve this easily? I thought about grouping, but I'm not sure how to split from this on...

回答1:

Setup

df=pd.DataFrame({'A': {0: 'one',
  1: 'one',
  2: 'one',
  3: 'two',
  4: 'three',
  5: 'three',
  6: 'four',
  7: 'five',
  8: 'five'},
 'B': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8}})

Solution

#get 2 unique keys from column A for df1. You can control the split either
# by absolute number in each group, or by a percentage. Check docs for the .sample() func.
df1_keys = df.A.drop_duplicates().sample(2)
df1 = df[df.A.isin(df1_keys)]
#anything not in df1_keys will be assigned to df2
df2 = df[~df.A.isin(df1_keys)]

df1_keys
Out[294]: 
7    five
0     one
Name: A, dtype: object

df1
Out[295]: 
      A  B
0   one  0
1   one  1
2   one  2
7  five  7
8  five  8

df2
Out[296]: 
       A  B
3    two  3
4  three  4
5  three  5
6   four  6

回答2:

v = df1['A'].unique() # Get the unique values
np.shuffle(v) # Shuffle them
v1,v2 = np.array_split(v,2) # Split the unique values into two arrays

Finally, index your dataframe using the .isin() method to get the desired result.

r1 = df[df['A'].isin(v1)]
r2 = df[df['A'].isin(v2)]

来源：https://stackoverflow.com/questions/44821090/split-dataframe-randomly-dependent-on-unique-values

标签

python

pandas