Best model for variable selection with big data?

故事扮演 提交于 2019-12-11 16:38:08

问题


I posted a question earlier about some code but now I realize I should be more broad with the general idea. Basically, I'm trying to build a statistical model with about 1000 observations and 2000 variables. I would like to determine which variables are most influential in effecting my dependent variable with high significance. I don't plan to use the model for prediction, just for variable selection. My independent variables are binary and dependent variable is continuous. I've tried multiple linear regression and fixed models with tools such as statsmodels and scikit-learn. However, I have encountered issues such as having more variables than observations. I would prefer to solve the problem in python since I have basic knowledge in it. However, stats is very new to me so I don't know the best direction. Any help is appreciated.

Tree method

import pandas as pd
from sklearn import tree
from sklearn import preprocessing

data=pd.read_excel('data_file.xlsx')

y=data.iloc[:, -1]
X=data.iloc[:, :-1]

le=preprocessing.LabelEncoder()
y=le.fit_transform(y)

clf=tree.DecisionTreeClassifier()
clf=clf.fit(X,y)

tree.export_graphviz(clf, out_file='tree.dot')

Or if I output to text file, the first few lines are:

digraph Tree {
node [shape=box] ;
0 [label="X[685] <= 0.5\ngini = 0.995\nsamples = 1097\nvalue = [2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1\n1, 1, 1, 8, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1\n1, 1, 1, 2, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 4, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1\n1, 3, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2\n1, 1, 1, 1, 1, 1, 30, 3, 1, 3, 1, 1, 2, 1\n1, 5, 1, 2, 1, 4, 2, 1, 1, 1, 1, 1, 1, 1\n1, 1, 2, 1, 1, 1, 3, 1, 1, 3, 1, 2, 1, 1\n1, 7, 3, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1\n6, 2, 1, 2, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 3, 7, 6, 1, 1, 1\n1, 1, 3, 4, 1, 1, 1, 1, 1, 4, 1, 2, 1, 1\n1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1\n1, 4, 1, 1, 4, 2, 1, 1, 1, 2, 1, 1, 2, 2\n11, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1, 12, 1\n1, 1, 3, 1, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1\n6, 1, 1, 1, 1, 1, 4, 2, 1, 2, 1, 1, 1, 1\n1, 1, 1, 1, 3, 1, 1, 3, 1, 1, 1, 1, 1, 1\n1, 1, 1, 1, 1, 11, 1, 2, 1, 2, 1, 1, 1, 1\n4, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1\n1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2\n1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3\n1, 7, 1, 1, 2, 1, 2, 7, 1, 1, 1, 3, 1, 11\n1, 1, 2, 2, 2, 1, 1, 10, 1, 1, 5, 21, 1, 1\n11, 1, 2, 1, 1, 1, 1, 1, 5, 15, 3, 1, 1, 1\n1, 1, 1, 3, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1\n1, 1, 6, 1, 1, 1, 1, 1, 1, 14, 1, 1, 1, 1\n17, 1, 1, 1, 1, 1, 1, 1, 2, 3, 1, 1, 1, 4\n1, 1, 1, 6, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1\n1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 14, 1\n3, 1, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 3, 1\n1, 2, 1, 12, 1, 1, 1, 1, 8, 2, 1, 1, 1, 2\n1, 1, 3, 1, 1, 6, 1, 1, 1, 3, 1, 1, 2, 1\n1, 1, 1, 1, 4, 1, 1, 2, 1, 3, 2, 4, 1, 3\n1, 1, 1, 1, 1, 7, 1, 1, 2, 1, 1, 2, 13, 2\n1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1\n9, 1, 2, 5, 7, 1, 1, 1, 2, 9, 2, 2, 13, 1\n1, 1, 1, 2, 1, 3, 1, 1, 6, 1, 3, 1, 1, 3\n1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 4, 1, 5, 1\n4, 1, 2, 3, 3]"] ;
1 [label="X[990] <= 0.5\ngini = 0.995\nsamples = 1040\nvalue = [2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1\n1, 1, 1, 8, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1\n1, 1, 1, 2, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 4, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1\n1, 3, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2\n1, 1, 1, 1, 1, 1, 30, 3, 1, 3, 1, 1, 2, 1\n1, 5, 1, 2, 1, 4, 2, 1, 1, 1, 1, 1, 1, 1\n1, 1, 2, 1, 1, 1, 3, 1, 1, 3, 1, 2, 1, 1\n1, 7, 3, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1\n6, 2, 1, 2, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 3, 7, 6, 1, 1, 1\n1, 1, 3, 4, 1, 1, 1, 1, 1, 4, 1, 2, 1, 1\n1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1\n1, 4, 1, 1, 4, 2, 1, 1, 1, 2, 1, 1, 2, 2\n11, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1, 12, 1\n1, 1, 3, 1, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1\n6, 1, 0, 1, 1, 1, 4, 2, 1, 2, 1, 1, 1, 1\n1, 1, 1, 1, 3, 1, 1, 3, 1, 1, 1, 0, 1, 1\n1, 1, 1, 1, 1, 9, 1, 2, 1, 2, 1, 1, 1, 1\n4, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1\n1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2\n1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3\n1, 7, 1, 1, 2, 1, 2, 7, 1, 1, 1, 1, 1, 11\n1, 1, 2, 2, 2, 1, 1, 10, 1, 1, 5, 21, 1, 1\n1, 1, 2, 1, 1, 1, 1, 1, 5, 15, 3, 1, 1, 1\n1, 1, 1, 3, 1, 1, 2, 1, 3, 1, 1, 0, 1, 1\n1, 1, 6, 1, 1, 1, 1, 1, 1, 14, 1, 1, 1, 1\n16, 1, 1, 1, 1, 1, 1, 1, 2, 3, 1, 1, 1, 4\n1, 1, 1, 6, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1\n1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 0, 1\n3, 1, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 3, 1\n1, 2, 1, 12, 1, 1, 1, 1, 8, 2, 0, 1, 1, 2\n1, 1, 3, 1, 1, 6, 1, 1, 1, 3, 1, 1, 2, 0\n1, 1, 1, 1, 4, 1, 1, 2, 1, 3, 2, 4, 1, 3\n1, 1, 1, 1, 1, 7, 1, 1, 2, 1, 0, 1, 3, 2\n1, 1, 1, 0, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1\n9, 1, 2, 5, 6, 1, 1, 1, 2, 9, 2, 2, 13, 1\n1, 1, 1, 2, 1, 3, 1, 1, 6, 1, 3, 1, 0, 3\n1, 0, 1, 1, 2, 0, 1, 2, 1, 1, 0, 1, 5, 1\n4, 1, 0, 3, 3]"] ;

回答1:


I would recommend getting closer look to variance of your variables ot keep those with the largest range (pandas.DataFrame.var()) and eliminate those variables which correlate at most with others (pandas.DataFrame.corr()), as further steps I'd suggest to get any methods mentioned earlier.




回答2:


1.Variante A: Feature Selection Scikit

For future selection scikitoffers a lot of different approaches: https://scikit-learn.org/stable/modules/feature_selection.html

Here it sumps up the comments from above.

2.Variante B: Feature Selection with linear regression

You can also read your feature importance if you run linearregression on it. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html .The function reg.coef_will give you the coefiecents for your futres, the higher the absolute number is, the more important is your feature, so for exmaple 0.8 is a really important future, where 0.00001 is not important.

3.Variante C: PCA (not for binary case)

Why you wanna kill your variables ? I would recommend you to use: PCA - Principal ocmponent analysis https://en.wikipedia.org/wiki/Principal_component_analysis.

The basic concept is to transform your 2000 features to a smaller space (maybe 1000 or whatever), while still being mathematically useful.

Scikik-learnhas a good package for it: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html



来源:https://stackoverflow.com/questions/56977952/best-model-for-variable-selection-with-big-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!