问题
I have a DataFrame df that looks like this:
| A | B | ... |
---------------------
| one | ... | ... |
| one | ... | ... |
| one | ... | ... |
| two | ... | ... |
| three | ... | ... |
| three | ... | ... |
| four | ... | ... |
| five | ... | ... |
| five | ... | ... |
As you can see for A there are 5 unique values. I want to split the DataFrame randomly. For example I want 3 unique values in DataFrame df1 and 2 unique values in DataFrame df2. My problem is that they aren't unique. I don't want to split these unique values over two DataFrames.
So the resulting DataFrame could look like this:
DataFrame df1 with 3 unique values:
| A | B | ... |
---------------------
| one | ... | ... |
| one | ... | ... |
| one | ... | ... |
| three | ... | ... |
| three | ... | ... |
| five | ... | ... |
| five | ... | ... |
DataFrame df2 with 2 unique values:
| A | B | ... |
---------------------
| two | ... | ... |
| four | ... | ... |
Is there anyway how to achieve this easily? I thought about grouping, but I'm not sure how to split from this on...
回答1:
Setup
df=pd.DataFrame({'A': {0: 'one',
1: 'one',
2: 'one',
3: 'two',
4: 'three',
5: 'three',
6: 'four',
7: 'five',
8: 'five'},
'B': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8}})
Solution
#get 2 unique keys from column A for df1. You can control the split either
# by absolute number in each group, or by a percentage. Check docs for the .sample() func.
df1_keys = df.A.drop_duplicates().sample(2)
df1 = df[df.A.isin(df1_keys)]
#anything not in df1_keys will be assigned to df2
df2 = df[~df.A.isin(df1_keys)]
df1_keys
Out[294]:
7 five
0 one
Name: A, dtype: object
df1
Out[295]:
A B
0 one 0
1 one 1
2 one 2
7 five 7
8 five 8
df2
Out[296]:
A B
3 two 3
4 three 4
5 three 5
6 four 6
回答2:
v = df1['A'].unique() # Get the unique values
np.shuffle(v) # Shuffle them
v1,v2 = np.array_split(v,2) # Split the unique values into two arrays
Finally, index your dataframe using the .isin() method to get the desired result.
r1 = df[df['A'].isin(v1)]
r2 = df[df['A'].isin(v2)]
来源:https://stackoverflow.com/questions/44821090/split-dataframe-randomly-dependent-on-unique-values