问题
The function im trying to write would take the dataframe provided and calculate the F statistic values and provide those as the output.
Data Format Final
Key Color Strength Fabric Sales
a 0 1 1 10
b 1 2 2 15
Here Color, strength and Fabric are independent while Sale is dependent.
The idea is to create a loop that creates a new dataframe for every unique key value: and perform a function over this dataframe and then create a new dataframe that is a concat of all the new dataframes obtained from unique key values
def regression():
X=Final1.copy()
y=Final1[['Sales']].copy()
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.2, random_state=0)
sel=f_classif(X_train, y_train)
p_values=pd.Series(sel[0], index=X_train.columns)
p_values=p_values.reset_index()
pd.options.display.float_format = "{:,.2f}".format
return p_values
Finals=[]
Finals=pd.DataFrame(Finals)
for group in Final.groupby('Key'):
# group is a tuple where the first value is the Key and the second is the dataframe
Final1=group[1]
Final1=pd.DataFrame(Final1)
result=regression()
Finals=pd.concat([Finals, result], axis=1)
# do xyz with result
print(Finals)
This is the code I came up with but its throwing an error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-131-c3a3b53971d5> in <module>
5 Final1=group[1]
6 Final1=pd.DataFrame(Final1)
----> 7 result=regression()
8 Finals=pd.concat([Finals, result], axis=1)
9
<ipython-input-120-d5c718baaba8> in regression()
2 X=Final1.iloc[:,7:-1].copy()
3 y=Final1[['Sale Rate']].copy()
----> 4 X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.2, random_state=0)
5 sel=f_classif(X_train, y_train)
6 p_values=pd.Series(sel[0], index=X_train.columns)
~\anaconda3\lib\site-packages\sklearn\model_selection\_split.py in train_test_split(*arrays, **options)
2120 n_samples = _num_samples(arrays[0])
2121 n_train, n_test = _validate_shuffle_split(n_samples, test_size, train_size,
-> 2122 default_test_size=0.25)
2123
2124 if shuffle is False:
~\anaconda3\lib\site-packages\sklearn\model_selection\_split.py in _validate_shuffle_split(n_samples, test_size, train_size, default_test_size)
1803 'resulting train set will be empty. Adjust any of the '
1804 'aforementioned parameters.'.format(n_samples, test_size,
-> 1805 train_size)
1806 )
1807
ValueError: With n_samples=1, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.
What could be going wrong with this code?
回答1:
A simple fix would be:
for group in Final.groupby('Key'):
# group is a tuple where the first value is the Key and the second is the dataframe
result = regression(group[1])
# do xyz with result
EDIT:
you do not have to convert group into a data frame again and can use it directly as it is already in the proper format.
# this line is not necessary
Final1 = pd.DataFrame(Final1)
judging from the error it is clear that the group
that you have passed into the train_test_split function does not have enough records. which is quite evident in the error message. you will have to handle for such errors using try, except.
回答2:
Code works once I filter out all keys with less than 10 observations
来源:https://stackoverflow.com/questions/62327251/applying-a-user-defined-function-to-a-dataframe