feature-selection | 易学教程

glmulti Oversized candidate set

阅读更多关于 glmulti Oversized candidate set

问题 Error message: SYSTEM: win7/64bit/ultimate/16gb-real-ram plus virtual memory, memory.limit(32000) What does this error message mean? In glmulti(y = "y", data = mydf, xr = c("x1", : !Oversized candidate set. mydf has 3.6mm rows & 150 columns of floats What steps to take to workaround it in glmulti? Any alternatives to glmulti in R world? R/64bit "Good Sport" 回答1: I have encountered the same problem, here is what I have found out so far: The number of rows does not seem to be the issue. The

Fast Information Gain computation

阅读更多关于 Fast Information Gain computation

问题 I need to compute Information Gain scores for >100k features in >10k documents for text classification . Code below works fine but for the full dataset is very slow - takes more than an hour on a laptop. Dataset is 20newsgroup and I am using scikit-learn, chi2 function which is provided in scikit works extremely fast. Any idea how to compute Information Gain faster for such dataset? def information_gain(x, y): def _entropy(values): counts = np.bincount(values) probs = counts[np.nonzero(counts

Information Gain calculation with Scikit-learn

阅读更多关于 Information Gain calculation with Scikit-learn

I am using Scikit-learn for text classification. I want to calculate the Information Gain for each attribute with respect to a class in a (sparse) document-term matrix. The Information Gain is defined as H(Class) - H(Class | Attribute), where H is the entropy. Using weka, this can be accomplished with the InfoGainAttribute . But I haven't found this measure in scikit-learn. However, it has been suggested that the formula above for Information Gain is the same measure as mutual information. This matches also the definition in wikipedia . Is it possible to use a specific setting for mutual

How is the feature score(/importance) in the XGBoost package calculated?

阅读更多关于 How is the feature score(/importance) in the XGBoost package calculated?

The command xgb.importance returns a graph of feature importance measured by an f score . What does this f score represent and how is it calculated? Output: Graph of feature importance T. Scharf This is a metric that simply sums up how many times each feature is split on. It is analogous to the Frequency metric in the R version. https://cran.r-project.org/web/packages/xgboost/xgboost.pdf It is about as basic a feature importance metric as you can get. i.e. How many times was this variable split on? The code for this method shows it is simply adding of the presence of a given feature in all the

Recursive feature elimination on Random Forest using scikit-learn

阅读更多关于 Recursive feature elimination on Random Forest using scikit-learn

I'm trying to preform recursive feature elimination using scikit-learn and a random forest classifier, with OOB ROC as the method of scoring each subset created during the recursive process. However, when I try to use the RFECV method, I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'coef_' Random Forests don't have coefficients per se, but they do have rankings by Gini score. So, I'm wondering how to get arround this problem. Please note that I want to use a method that will explicitly tell me what features from my pandas DataFrame were selected in the

Random Forest Feature Importance Chart using Python

阅读更多关于 Random Forest Feature Importance Chart using Python

I am working with RandomForestRegressor in python and I want to create a chart that will illustrate the ranking of feature importance. This is the code I used: from sklearn.ensemble import RandomForestRegressor MT= pd.read_csv("MT_reduced.csv") df = MT.reset_index(drop = False) columns2 = df.columns.tolist() # Filter the columns to remove ones we don't want. columns2 = [c for c in columns2 if c not in["Violent_crime_rate","Change_Property_crime_rate","State","Year"]] # Store the variable we'll be predicting on. target = "Property_crime_rate" # Let’s randomly split our data with 80% as the

R caret package rfe never finishes error task 1 failed - “replacement has length zero”

阅读更多关于 R caret package rfe never finishes error task 1 failed - “replacement has length zero”

I recently started to look into caret package for a model I'm developing. I'm using the latest version. As the first step, I decided to use it for feature selection. The data I'm using has about 760 features and 10k observations. I created a simple function based on the training material on line. Unfortunately, I consistently get an error and so the process never finishes. Here is the code that produces error. In this example I am using a small subset of features. I started with the full set of features. I've also changed the subsets, number of folds and repeats to no avail. I know it will be

Information Gain calculation with Scikit-learn

阅读更多关于 Information Gain calculation with Scikit-learn

问题 I am using Scikit-learn for text classification. I want to calculate the Information Gain for each attribute with respect to a class in a (sparse) document-term matrix. The Information Gain is defined as H(Class) - H(Class | Attribute), where H is the entropy. Using weka, this can be accomplished with the InfoGainAttribute. But I haven't found this measure in scikit-learn. However, it has been suggested that the formula above for Information Gain is the same measure as mutual information.

How is the feature score(/importance) in the XGBoost package calculated?

阅读更多关于 How is the feature score(/importance) in the XGBoost package calculated?

问题 The command xgb.importance returns a graph of feature importance measured by an f score . What does this f score represent and how is it calculated? Output: Graph of feature importance 回答1: This is a metric that simply sums up how many times each feature is split on. It is analogous to the Frequency metric in the R version.https://cran.r-project.org/web/packages/xgboost/xgboost.pdf It is about as basic a feature importance metric as you can get. i.e. How many times was this variable split on?

Recursive feature elimination on Random Forest using scikit-learn

阅读更多关于 Recursive feature elimination on Random Forest using scikit-learn

问题 I'm trying to preform recursive feature elimination using scikit-learn and a random forest classifier, with OOB ROC as the method of scoring each subset created during the recursive process. However, when I try to use the RFECV method, I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'coef_' Random Forests don't have coefficients per se, but they do have rankings by Gini score. So, I'm wondering how to get arround this problem. Please note that I want to