Weka's PCA is taking too long to run

后端 未结 3 788
隐瞒了意图╮
隐瞒了意图╮ 2021-01-31 12:44

I am trying to use Weka for feature selection using PCA algorithm.

My original feature space contains ~9000 attributes, in 2700 samples.
I tried to reduce dimensiona

3条回答
  •  不要未来只要你来
    2021-01-31 13:20

    After deepening in the WEKA code, the bottle neck is creating the covariance matrix, and then calculating the eigenvectors for this matrix. Even trying to switch to sparsed matrix implementation (I used COLT's SparseDoubleMatrix2D) did not help.

    The solution I came up with was first reduce the dimensionality using a first fast method (I used information gain ranker, and filtering based on document frequencey), and then use PCA on the reduced dimensionality to reduce it farther.

    The code is more complex, but it essentially comes down to this:

    Ranker ranker = new Ranker();
    InfoGainAttributeEval ig = new InfoGainAttributeEval();
    Instances instances = SamplesManager.asWekaInstances(trainSet);
    ig.buildEvaluator(instances);
    firstAttributes = ranker.search(ig,instances);
    candidates = Arrays.copyOfRange(firstAttributes, 0, FIRST_SIZE_REDUCTION);
    instances = reduceDimenstions(instances, candidates)
    PrincipalComponents pca = new PrincipalComponents();
    pca.setVarianceCovered(var);
    ranker = new Ranker();
    ranker.setNumToSelect(numFeatures);
    selection = new AttributeSelection();
    selection.setEvaluator(pca);
    selection.setSearch(ranker);
    selection.SelectAttributes(instances );
    instances = selection.reduceDimensionality(wekaInstances);
    

    However, this method scored worse then using a greedy information gain and a ranker, when I cross-validated for estimated accuracy.

提交回复
热议问题