问题
I have a pandas
DataFrame whose index is unique user identifiers, columns corresponding to unique events, and values 1 (attended), 0 (did not attend), or NaN (wasn't invited/not relevant). The matrix is pretty sparse with respect to NaNs: there are several hundred events and most users were only invited to several tens at most.
I created some extra columns to measure the "success" which I define as just % attended relative to invites:
my_data['invited'] = my_data.count(axis=1)
my_data['attended'] = my_data.sum(axis=1)-my_data['invited']
my_data['success'] = my_data['attended']/my_data['invited']
Assume the following is true: the success data should be normally distributed with mean 0.80 and s.d. 0.10. When I look at the histogram of my_data['success']
it was not normal and skewed left. It is not important if this is true in reality. I just want to solve the technical problem I pose below.
So this is my problem: there are some events which I don't think are "good" in the sense that they are making the success data diverge from normal. I'd like to do "feature selection" on my events to pick a subset of them which makes the distribution of my_data['success']
as close to normal as possible in the sense of "convergence in distribution".
I looked at the scikit-learn
"feature selection" methods here and the "Univariate feature selection" seems like it makes sense. But I'm very new to both pandas
and scikit-learn
and could really use help on how to actually implement this in code.
Constraints: I need to keep at least half the original events.
Any help would be greatly appreciated. Please share as many details as you can, I am very new to these libraries and would love to see how to do this with my DataFrame.
Thanks!
EDIT: After looking some more at the scikit-learn
feature selection approaches, "Recursive feature selection" seems like it might make sense here too but I'm not sure how to build it up with my "accuracy" metric being "close to normally distributed with mean..."
回答1:
Keep in mind that feature selection is to select features, not samples, i.e., (typically) the columns of your DataFrame
, not the rows. So, I am not sure if feature selection is what you want: I understand that you want to remove those samples that cause the skew in your distribution?
Also, what about feature scaling, e.g., standardization, so that your data becomes normal distributed with mean=0 and sd=1?
The equation is simply z = (x - mean) / sd
To apply it to your DataFrame, you can simply do
my_data['success'] = (my_data['success'] - my_data['success'].mean(axis=0)) / (my_data['success'].std(axis=0))
However, don't forget to keep the mean and SD parameters to transform your test data, too. Alternatively, you could also use the StandardScaler from scikit-learn
来源:https://stackoverflow.com/questions/29069909/python-feature-selection-in-sci-kit-learn-for-a-normal-distribution