apache-spark-mllib | 易学教程

How to get the maxDepth from a Spark RandomForestRegressionModel

阅读更多关于 How to get the maxDepth from a Spark RandomForestRegressionModel

问题 In Spark (2.1.0) I've used a CrossValidator to train a RandomForestRegressor , using ParamGridBuilder for maxDepth and numTrees : paramGrid = ParamGridBuilder() \ .addGrid(rf.maxDepth, [2, 4, 6, 8, 10]) \ .addGrid(rf.numTrees, [10, 20, 40, 50]) \ .build() After training, I can get the best number of trees: regressor = cvModel.bestModel.stages[len(cvModel.bestModel.stages) - 1] print(regressor.getNumTrees) but I can't work out how to get the best maxDepth. I've read the documentation and I don

Jaro-Winkler score calculation in Apache Spark

阅读更多关于 Jaro-Winkler score calculation in Apache Spark

问题 We need to implement Jaro-Winkler distance calculation across string in Apache Spark Dataset . We are new to spark and after searching in web we are not able to find much. It would be great if you can guide us. We thought of using flatMap then realized it won’t help, then we tried to use couple of foreach loops but not able to figure how to go forward. As each of the string has to be compared against all. Like in the below dataset. RowFactory.create(0, "Hi I heard about Spark"), RowFactory

Spark add new fitted stage to a exitsting PipelineModel without fitting again

阅读更多关于 Spark add new fitted stage to a exitsting PipelineModel without fitting again

问题 I have a saved PipelineModel: pipe_model = pipe.fit(df_train) pipe_model.write().overwrite().save("/user/pipe_text_2") And now I want to add to this Pipe a new already fited PipelineModel: pipe_model = PipelineModel.load("/user/pipe_text_2") df2 = pipe_model.transform(df1) kmeans = KMeans(k=20) pipe2 = Pipeline(stages=[kmeans]) pipe_model2 = pipe2.fit(df2) Is that possible without fitting it again? In order to obtain a new PipelineModel but not a new Pipeline. The ideal thing would be the

Spark ML Kmeans give : org.apache.spark.SparkException: Failed to execute user defined function($anonfun$2: (vector) => int)

阅读更多关于 Spark ML Kmeans give : org.apache.spark.SparkException: Failed to execute user defined function($anonfun$2: (vector) => int)

问题 I try to load the KmeansModel and then get the label out of it : Here is the code that I have written : val kMeansModel = KMeansModel.load(trainedMlModel.mlModelFilePath) val arrayOfElements = measurePoint.measurements.map(a => a._2).toSeq println(s"ArrayOfELements::::$arrayOfElements") val arrayDF = sparkContext.parallelize(arrayOfElements).toDF() arrayDF.show() val vectorDF = new VectorAssembler().setInputCols(arrayDF.columns).setOutputCol("features").transform(arrayDF) vectorDF.printSchema

Create a DataFrame in Spark Stream

阅读更多关于 Create a DataFrame in Spark Stream

问题 I've connected the Kafka Stream to the Spark. As well as I've trained Apache Spark Mlib model to prediction based on a streamed text. My problem is, get a prediction I need to pass a DataFramework. //kafka stream val stream = KafkaUtils.createDirectStream[String, String]( ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams) ) //load mlib model val model = PipelineModel.load(modelPath) stream.foreachRDD { rdd => rdd.foreach { record => //to get a prediction need to pass DF val

How to combine or merge two sparse vectors in Spark using Java?

阅读更多关于 How to combine or merge two sparse vectors in Spark using Java?

问题 I used the Java's API, i.e. Apache-Spark 1.2.0, and created two parse vectors as follows. Vector v1 = Vectors.sparse(3, new int[]{0, 2}, new double[]{1.0, 3.0}); Vector v2 = Vectors.sparse(2, new int[]{0, 1}, new double[]{4,5}); How can I get a new vector v3 that is formed by combining v1 and v2 , so the result should be: (5, [0,2,3,4],[1.0, 3.0, 4.0, 5.0]) 回答1: I found the problem has been one year and is still pending. Here, I solved the problem by writing a helper function myself, as

Issue with Spark MLLib that causes probability and prediction to be the same for everything

阅读更多关于 Issue with Spark MLLib that causes probability and prediction to be the same for everything

问题 I'm learning how to use Machine Learning with Spark MLLib with the purpose of doing Sentiment Analysis of Tweets. I got a Sentiment Analysis dataset from here: http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip That dataset contains 1 million of tweets classified as Positive or Negative. The second column of this dataset contains the sentiment and the fourth column contains the tweet. This is my current PySpark code: import csv from pyspark.sql import Row from

How best to fit many Spark ML models

阅读更多关于 How best to fit many Spark ML models

问题 (PySpark, either Spark 1.6 or 2.0, shared YARN cluster with dozens of nodes) I'd like to run a bootstrapping analysis, with each boot strap sample running on a dataset that's too large to fit on a single executor. The naive approach I was going to start with is: create spark dataframe of training dataset for i in (1,1000): use df.sample() to create a sample_df train the model (logistic classifier) on sample_df Although each individual model is fit across the cluster, this doesn't seem to be

How to save results of Model to text file?

阅读更多关于 How to save results of Model to text file?

问题 I am trying to save the Frequent itemsets generated from the model to a text file. The code is an example of FPGrowth example in Spark ML library. Using saveAsTextFile directly on the model writes the RDD locations and not the actual values. import org.apache.spark.mllib.fpm.FPGrowth import org.apache.spark.rdd.RDD val data = sc.textFile("/home/ponny/Freq") val data1 = sc.textFile("/home/ponny/Scala_Examples/test.txt") val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' '))

How to get the probabilities of classes in Spark Naive Bayes classifier?

阅读更多关于 How to get the probabilities of classes in Spark Naive Bayes classifier?

问题 I'm training a NaiveBayesModel in Spark, however when I'm using it to predict a new instance I need to get the probabilities for each class. I looked at the code of predict function in NaiveBayesModel and come up with the following code: val thetaMatrix = new DenseMatrix (model.labels.length,model.theta(0).length,model.theta.flatten,true) val piVector = new DenseVector(model.pi) //val prob = thetaMatrix.multiply(test.features) val x = test.map {p => val prob = thetaMatrix.multiply(p.features)