Weka's PCA is taking too long to run

后端未结

关注

 3  788

隐瞒了意图╮ 2021-01-31 12:44

I am trying to use Weka for feature selection using PCA algorithm.

My original feature space contains ~9000 attributes, in 2700 samples.
I tried to reduce dimensiona

3条回答

不要未来只要你来 (楼主)

2021-01-31 13:20
After deepening in the WEKA code, the bottle neck is creating the covariance matrix, and then calculating the eigenvectors for this matrix. Even trying to switch to sparsed matrix implementation (I used COLT's SparseDoubleMatrix2D) did not help.

The solution I came up with was first reduce the dimensionality using a first fast method (I used information gain ranker, and filtering based on document frequencey), and then use PCA on the reduced dimensionality to reduce it farther.

The code is more complex, but it essentially comes down to this:
```
Ranker ranker = new Ranker();
InfoGainAttributeEval ig = new InfoGainAttributeEval();
Instances instances = SamplesManager.asWekaInstances(trainSet);
ig.buildEvaluator(instances);
firstAttributes = ranker.search(ig,instances);
candidates = Arrays.copyOfRange(firstAttributes, 0, FIRST_SIZE_REDUCTION);
instances = reduceDimenstions(instances, candidates)
PrincipalComponents pca = new PrincipalComponents();
pca.setVarianceCovered(var);
ranker = new Ranker();
ranker.setNumToSelect(numFeatures);
selection = new AttributeSelection();
selection.setEvaluator(pca);
selection.setSearch(ranker);
selection.SelectAttributes(instances );
instances = selection.reduceDimensionality(wekaInstances);
```
However, this method scored worse then using a greedy information gain and a ranker, when I cross-validated for estimated accuracy.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...