apache-spark-ml

SPARK ML, Naive Bayes classifier: high probability prediction for one class

孤街浪徒 提交于 2019-12-02 03:49:28
问题 I am using Spark ML to optimise a Naive Bayes multi-class classifier. I have about 300 categories and I am classifying text documents. The training set is balanced enough and there is about 300 training examples for each category. All looks good and the classifier is working with acceptable precision on unseen documents. But what I am noticing that when classifying a new document, very often, the classifier assigns a high probability to one of the categories (the prediction probability is

Is there a limit on the number of classes in mllib NaiveBayes? Error calling model.save()

假如想象 提交于 2019-12-02 02:46:23
问题 I am trying to train a model to predict the category of text input data. I am running into what seems to be numerical instability using the pyspark.ml.classification.NaiveBayes classifier on a bag-of-words when the number of classes is above a certain amount. In my real world project, I have on the order of ~1bn records and ~50 classes. I am able to train my model and make predictions but I get an error when I try to save it using model.save() . Operationally, this is annoying since I have to

How to eval spark.ml model without DataFrames/SparkContext?

≯℡__Kan透↙ 提交于 2019-12-02 02:25:36
With Spark MLLib, I'd build a model (like RandomForest ), and then it was possible to eval it outside of Spark by loading the model and using predict on it passing a vector of features. It seems like with Spark ML, predict is now called transform and only acts on a DataFrame . Is there any way to build a DataFrame outside of Spark since it seems like one needs a SparkContext to build a DataFrame? Am I missing something? Re: Is there any way to build a DataFrame outside of Spark? It is not possible. DataFrames live inside SQLContext with it living in SparkContext. Perhaps you could work it

Is there a limit on the number of classes in mllib NaiveBayes? Error calling model.save()

a 夏天 提交于 2019-12-02 01:30:16
I am trying to train a model to predict the category of text input data. I am running into what seems to be numerical instability using the pyspark.ml.classification.NaiveBayes classifier on a bag-of-words when the number of classes is above a certain amount. In my real world project, I have on the order of ~1bn records and ~50 classes. I am able to train my model and make predictions but I get an error when I try to save it using model.save() . Operationally, this is annoying since I have to retrain my model each time from scratch. In trying to debug, I scaled my data down to around ~10k rows

Spark ML StringIndexer Different Labels Training/Testing

别等时光非礼了梦想. 提交于 2019-12-02 01:08:05
I'm using Scala and am using StringIndexer to assign indices to each category in my training set. It assigns indices based on the frequency of each category. The problem is that in my testing data, the frequency of the categories are different and so StringIndexer assigns different indices to the categories, which prevents me from evaluating the model (Random Forest) correctly. I am processing the training/testing data in the exact same way, and don't save the model. I have tried manually creating labels (by getting the index of the category), but get this error java.lang

SPARK ML, Naive Bayes classifier: high probability prediction for one class

浪尽此生 提交于 2019-12-02 00:52:23
I am using Spark ML to optimise a Naive Bayes multi-class classifier. I have about 300 categories and I am classifying text documents. The training set is balanced enough and there is about 300 training examples for each category. All looks good and the classifier is working with acceptable precision on unseen documents. But what I am noticing that when classifying a new document, very often, the classifier assigns a high probability to one of the categories (the prediction probability is almost equal to 1), while the other categories receive very low probabilities (close to zero). What are

Spark: How to get probabilities and AUC for Bernoulli Naive Bayes?

拟墨画扇 提交于 2019-12-01 22:46:27
I'm running a Bernoulli Naive Bayes using code: val splits = MyData.randomSplit(Array(0.75, 0.25), seed = 2L) val training = splits(0).cache() val test = splits(1) val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli") My question is how can I get the probability of membership to class 0 (or 1) and count AUC. I want to get similar result to LogisticRegressionWithSGD or SVMWithSGD where I was using this code: val numIterations = 100 val model = SVMWithSGD.train(training, numIterations) model.clearThreshold() // Compute raw scores on the test set. val labelAndPreds = test

How to get classification probabilities from MultilayerPerceptronClassifier?

爱⌒轻易说出口 提交于 2019-12-01 19:32:15
This seems most related to: How to get the probability per instance in classifications models in spark.mllib I'm doing a classification task with spark ml, building a MultilayerPerceptronClassifier. Once I build a model, I can get a predicted class given an input vector, but I can't get the probability for each output class. The above listing indicates that NaiveBayesModel supports this functionality as of Spark 1.5.0 (using a predictProbabilities method) . I would like to get at this functionality for the MLPC. Is there a way I can hack at it to get my probabilities? Will it be included in 1

How to set parameters for a custom PySpark Transformer once it's a stage in a fitted ML Pipeline?

心已入冬 提交于 2019-12-01 13:59:39
I've written a custom ML Pipeline Estimator and Transformer for my own Python algorithm by following the pattern shown here . However, in that example all the parameters needed by _transform() were conveniently passed into the Model/Transformer by the estimator's _fit() method. But my transformer has several parameters that control the way the transform is applied. These parameters are specific to the transformer so it would feel odd to pass them into the estimator in advance along with the estimator-specific parameters used for fitting the model. I can work around this by adding extra Params

How to get best params after tuning by pyspark.ml.tuning.TrainValidationSplit?

浪尽此生 提交于 2019-12-01 12:56:51
I'm trying to tune the hyper-parameters of a Spark (PySpark) ALS model by TrainValidationSplit . It works well, but I want to know which combination of hyper-parameters is the best. How to get best params after evaluation ? from pyspark.ml.recommendation import ALS from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder from pyspark.ml.evaluation import RegressionEvaluator df = sqlCtx.createDataFrame( [(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 2, 5.0)], ["user", "item", "rating"], ) df_test = sqlCtx.createDataFrame( [(0, 0), (0, 1), (1, 1), (1, 2), (2, 1