apache-spark-ml | 易学教程

How to prepare for training data in mllib

阅读更多关于 How to prepare for training data in mllib

问题 TL;DR; How do I use mllib to train my wiki data (text & category) for prediction against tweets? I have trouble figuring out how to convert my tokenized wiki data so that it can be trained through either NaiveBayes or LogisticRegression . My goal is to use the trained model for comparison against tweets*. I've tried using pipelines with LR and HashingTF with IDF for NaiveBayes but I keep getting wrong predictions. Here's what I've tried: *Note that I would like to use the many categories in

Spark ML Pipeline with RandomForest takes too long on 20MB dataset

阅读更多关于 Spark ML Pipeline with RandomForest takes too long on 20MB dataset

问题 I am using Spark ML to run some ML experiments, and on a small dataset of 20MB (Poker dataset) and a Random Forest with parameter grid, it takes 1h and 30 minutes to finish. Similarly with scikit-learn it takes much much less. In terms of environment, I was testing with 2 slaves, 15GB memory each, 24 cores. I assume it was not supposed to take that long and I am wondering if the problem lies within my code, since I am fairly new to Spark. Here it is: df = pd.read_csv(http://archive.ics.uci

What do columns ‘rawPrediction’ and ‘probability’ of DataFrame mean in Spark MLlib？

阅读更多关于 What do columns ‘rawPrediction’ and ‘probability’ of DataFrame mean in Spark MLlib？

问题 After I trained a LogisticRegressionModel, I transformed the test data DF with it and get the prediction DF. And then when I call prediction.show(), the output column names are: [label | features | rawPrediction | probability | prediction] . I know what label and featrues mean, but how should I understand rawPrediction|probability|prediction ? 回答1: RawPrediction is typically the direct probability/confidence calculation. From Spark docs: Raw prediction for each possible label. The meaning of

Caching intermediate results in Spark ML pipeline

阅读更多关于 Caching intermediate results in Spark ML pipeline

问题 Lately I'm planning to migrate my standalone python ML code to spark. The ML pipeline in spark.ml turns out quite handy, with streamlined API for chaining up algorithm stages and hyper-parameter grid search. Still, I found its support for one important feature obscure in existing documents: caching of intermediate results . The importance of this feature arise when the pipeline involves computation intensive stages. For example, in my case I use a huge sparse matrix to perform multiple moving

How to get classification probabilities from MultilayerPerceptronClassifier?

阅读更多关于 How to get classification probabilities from MultilayerPerceptronClassifier?

问题 This seems most related to: How to get the probability per instance in classifications models in spark.mllib I'm doing a classification task with spark ml, building a MultilayerPerceptronClassifier. Once I build a model, I can get a predicted class given an input vector, but I can't get the probability for each output class. The above listing indicates that NaiveBayesModel supports this functionality as of Spark 1.5.0 (using a predictProbabilities method). I would like to get at this

Spark DataFrame handing empty String in OneHotEncoder

阅读更多关于 Spark DataFrame handing empty String in OneHotEncoder

问题 I am importing a CSV file (using spark-csv) into a DataFrame which has empty String values. When applied the OneHotEncoder , the application crashes with error requirement failed: Cannot have an empty string for name. . Is there a way I can get around this? I could reproduce the error in the example provided on Spark ml page: val df = sqlContext.createDataFrame(Seq( (0, "a"), (1, "b"), (2, "c"), (3, ""), //<- original example has "a" here (4, "a"), (5, "c") )).toDF("id", "category") val

How to get best params after tuning by pyspark.ml.tuning.TrainValidationSplit?

阅读更多关于 How to get best params after tuning by pyspark.ml.tuning.TrainValidationSplit?

问题 I'm trying to tune the hyper-parameters of a Spark (PySpark) ALS model by TrainValidationSplit . It works well, but I want to know which combination of hyper-parameters is the best. How to get best params after evaluation ? from pyspark.ml.recommendation import ALS from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder from pyspark.ml.evaluation import RegressionEvaluator df = sqlCtx.createDataFrame( [(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 2, 5.0)],

Why spark.ml don't implement any of spark.mllib algorithms?

阅读更多关于 Why spark.ml don't implement any of spark.mllib algorithms?

问题 Following the Spark MLlib Guide we can read that Spark has two machine learning libraries: spark.mllib , built on top of RDDs. spark.ml , built on top of Dataframes. According to this and this question on StackOverflow, Dataframes are better (and newer) than RDDs and should be used whenever possible. The problem is that I want to use common machine learning algorithms (e.g: Frequent Pattern Mining,Naive Bayes, etc.) and spark.ml (for dataframes) don't provide such methods, only spark.mllib

How do I convert an RDD with a SparseVector Column to a DataFrame with a column as Vector

阅读更多关于 How do I convert an RDD with a SparseVector Column to a DataFrame with a column as Vector

问题 I have an RDD with a tuple of values (String, SparseVector) and I want to create a DataFrame using the RDD . To get a (label:string, features:vector) DataFrame which is the Schema required by most of the ml algorithm's libraries. I know it can be done because HashingTF ml Library outputs a vector when given a features column of a DataFrame . temp_df = sqlContext.createDataFrame(temp_rdd, StructType([ StructField("label", DoubleType(), False), StructField("tokens", ArrayType(StringType()),

Spark: OneHot encoder and storing Pipeline (feature dimension issue)

阅读更多关于 Spark: OneHot encoder and storing Pipeline (feature dimension issue)

问题 We have a pipeline (2.0.1) consisting of multiple feature transformation stages. Some of these stages are OneHot encoders. Idea: classify an integer-based category into n independent features. When training the pipeline model and using it to predict all works fine. However, storing the trained pipeline model and reloading it causes issues: The stored 'trained' OneHot encoder does not keep track of how many categories there are. Loading it now causes issues: When the loaded model is used to