How to optimize SciKit one-class training time?

廉价感情. 提交于 2019-12-11 07:02:53

问题


Essentially my questions is the same as SciKit One-class SVM classifier training time increases exponentially with size of training data, but no one has figured out the problem.

It seems to run fine for somewhere in the 10s of thousands, but 100s of thousands take very long. And I want to run it on 10s of millions, but I don't want to wait a day and a half (maybe even more) for nothing to come of it. Is there a faster way about it, or should I use something else?


回答1:


I'm very junior in this field, so take this with a grain of salt.

Isolation Forests appear to be an efficient solution for outlier detection. They have been shown to perform well against other popular algorithms [Liu, 2008]. Also, One-class SVMs are somewhat susceptible to anomalies according to scikit learn. The anomalies in your Class 1 could overlap with Class 2 and cause data to be mislabeled... perhaps taking subsets of your samples and using them to create an ensemble of SVMs could avoid this (and still save you time, depending on the size of the subsets), but Isolation Forests naturally do this.

For further reading, this seems like a good reference paper on the topic http://www.robots.ox.ac.uk/~davidc/pubs/NDreview2014.pdf

It mentions clustering and distance methods which may be applicable in your case. I think it's best to do a lot of reading and make sure you understand the different strengths/weaknesses of the algorithms. Especially since I'm in the process of doing that and really can't give solid advice even if I knew the specifics of your problem.

Note re:distance based algorithms. I know some are optimized, but I think the general complaint is that they have high computation complexity. Many clustering/distance/probability based algorithms also have weaknesses dealing with high dimensionality data.



来源:https://stackoverflow.com/questions/45472516/how-to-optimize-scikit-one-class-training-time

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!