问题
I was applying .sample with random_state set to a constant and after using set_index it started selecting different rows. A member dropped that was previously included in the subset. I'm unsure how seeding selects rows. Does it make sense or did something go wrong?
Here is what was done:
df.set_index('id',inplace=True, verify_integrity=True)
df_small_F = df.loc[df['gender']=='F'].apply(lambda x: x.sample(n=30000, random_state=47))
df_small_M = df.loc[df['gender']=='M'].apply(lambda x: x.sample(n=30000, random_state=46))
df_small=pd.concat([df_small_F,df_small_M],verify_integrity=True)
When I sort df_small by index and print, it produces different results.
回答1:
Applying .sort_index() after reading in the data and before performing .sample() corrected the issue. As long as the data remains the same, this will produce the same sample everytime.
回答2:
When sampling rows (without weight), the only things that matter are n, the number of rows, and whether or not you choose replacement. This generates the .iloc indices to take, regardless of the data.
For rows, sampling occurs as;
axis_length = self.shape[0] # DataFrame length
rs = pd.core.common.random_state(random_state)
locs = rs.choice(axis_length, size=n, replace=replace, p=weights) # np.random_choice
return self.take(locs, axis=axis, is_copy=False)
Just to illustrate the point
Sample Data
import pandas as pd
import numpy as np
n = 100000
np.random.seed(123)
df = pd.DataFrame({'id': list(range(n)), 'gender': np.random.choice(['M', 'F'], n)})
df1 = pd.DataFrame({'id': list(range(n)), 'gender': ['M']},
index=np.random.choice(['foo', 'bar', np.NaN], n)).assign(blah=1)
Sampling will always choose row 42083 (integer array index): df.iloc[42803] for this seed and length:
df.sample(n=1, random_state=123)
# id gender
#42083 42083 M
df1.sample(n=1, random_state=123)
# id gender blah
#foo 42083 M 1
df1.reset_index().shift(10).sample(n=1, random_state=123)
# index id gender blah
#42083 nan 42073.0 M 1.0
Even numpy:
np.random.seed(123)
np.random.choice(df.shape[0], size=1, replace=False)
#array([42083])
来源:https://stackoverflow.com/questions/55360354/random-seed-chose-different-rows