I am trying to use Weka for feature selection using PCA algorithm.
My original feature space contains ~9000 attributes, in 2700 samples.
I tried to reduce dimensiona
After deepening in the WEKA code, the bottle neck is creating the covariance matrix, and then calculating the eigenvectors for this matrix. Even trying to switch to sparsed matrix implementation (I used COLT's SparseDoubleMatrix2D) did not help.
The solution I came up with was first reduce the dimensionality using a first fast method (I used information gain ranker, and filtering based on document frequencey), and then use PCA on the reduced dimensionality to reduce it farther.
The code is more complex, but it essentially comes down to this:
Ranker ranker = new Ranker();
InfoGainAttributeEval ig = new InfoGainAttributeEval();
Instances instances = SamplesManager.asWekaInstances(trainSet);
ig.buildEvaluator(instances);
firstAttributes = ranker.search(ig,instances);
candidates = Arrays.copyOfRange(firstAttributes, 0, FIRST_SIZE_REDUCTION);
instances = reduceDimenstions(instances, candidates)
PrincipalComponents pca = new PrincipalComponents();
pca.setVarianceCovered(var);
ranker = new Ranker();
ranker.setNumToSelect(numFeatures);
selection = new AttributeSelection();
selection.setEvaluator(pca);
selection.setSearch(ranker);
selection.SelectAttributes(instances );
instances = selection.reduceDimensionality(wekaInstances);
However, this method scored worse then using a greedy information gain and a ranker, when I cross-validated for estimated accuracy.