apache-spark-mllib

How to get the maxDepth from a Spark RandomForestRegressionModel

拥有回忆 提交于 2019-12-11 06:47:12
问题 In Spark (2.1.0) I've used a CrossValidator to train a RandomForestRegressor , using ParamGridBuilder for maxDepth and numTrees : paramGrid = ParamGridBuilder() \ .addGrid(rf.maxDepth, [2, 4, 6, 8, 10]) \ .addGrid(rf.numTrees, [10, 20, 40, 50]) \ .build() After training, I can get the best number of trees: regressor = cvModel.bestModel.stages[len(cvModel.bestModel.stages) - 1] print(regressor.getNumTrees) but I can't work out how to get the best maxDepth. I've read the documentation and I don

Jaro-Winkler score calculation in Apache Spark

北战南征 提交于 2019-12-11 06:08:13
问题 We need to implement Jaro-Winkler distance calculation across string in Apache Spark Dataset . We are new to spark and after searching in web we are not able to find much. It would be great if you can guide us. We thought of using flatMap then realized it won’t help, then we tried to use couple of foreach loops but not able to figure how to go forward. As each of the string has to be compared against all. Like in the below dataset. RowFactory.create(0, "Hi I heard about Spark"), RowFactory

Spark add new fitted stage to a exitsting PipelineModel without fitting again

☆樱花仙子☆ 提交于 2019-12-11 05:33:46
问题 I have a saved PipelineModel: pipe_model = pipe.fit(df_train) pipe_model.write().overwrite().save("/user/pipe_text_2") And now I want to add to this Pipe a new already fited PipelineModel: pipe_model = PipelineModel.load("/user/pipe_text_2") df2 = pipe_model.transform(df1) kmeans = KMeans(k=20) pipe2 = Pipeline(stages=[kmeans]) pipe_model2 = pipe2.fit(df2) Is that possible without fitting it again? In order to obtain a new PipelineModel but not a new Pipeline. The ideal thing would be the

Spark ML Kmeans give : org.apache.spark.SparkException: Failed to execute user defined function($anonfun$2: (vector) => int)

别说谁变了你拦得住时间么 提交于 2019-12-11 05:03:59
问题 I try to load the KmeansModel and then get the label out of it : Here is the code that I have written : val kMeansModel = KMeansModel.load(trainedMlModel.mlModelFilePath) val arrayOfElements = measurePoint.measurements.map(a => a._2).toSeq println(s"ArrayOfELements::::$arrayOfElements") val arrayDF = sparkContext.parallelize(arrayOfElements).toDF() arrayDF.show() val vectorDF = new VectorAssembler().setInputCols(arrayDF.columns).setOutputCol("features").transform(arrayDF) vectorDF.printSchema

Create a DataFrame in Spark Stream

痴心易碎 提交于 2019-12-11 04:22:49
问题 I've connected the Kafka Stream to the Spark. As well as I've trained Apache Spark Mlib model to prediction based on a streamed text. My problem is, get a prediction I need to pass a DataFramework. //kafka stream val stream = KafkaUtils.createDirectStream[String, String]( ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams) ) //load mlib model val model = PipelineModel.load(modelPath) stream.foreachRDD { rdd => rdd.foreach { record => //to get a prediction need to pass DF val

How to combine or merge two sparse vectors in Spark using Java?

筅森魡賤 提交于 2019-12-11 03:18:05
问题 I used the Java's API, i.e. Apache-Spark 1.2.0, and created two parse vectors as follows. Vector v1 = Vectors.sparse(3, new int[]{0, 2}, new double[]{1.0, 3.0}); Vector v2 = Vectors.sparse(2, new int[]{0, 1}, new double[]{4,5}); How can I get a new vector v3 that is formed by combining v1 and v2 , so the result should be: (5, [0,2,3,4],[1.0, 3.0, 4.0, 5.0]) 回答1: I found the problem has been one year and is still pending. Here, I solved the problem by writing a helper function myself, as

Issue with Spark MLLib that causes probability and prediction to be the same for everything

天大地大妈咪最大 提交于 2019-12-11 02:29:47
问题 I'm learning how to use Machine Learning with Spark MLLib with the purpose of doing Sentiment Analysis of Tweets. I got a Sentiment Analysis dataset from here: http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip That dataset contains 1 million of tweets classified as Positive or Negative. The second column of this dataset contains the sentiment and the fourth column contains the tweet. This is my current PySpark code: import csv from pyspark.sql import Row from

How best to fit many Spark ML models

試著忘記壹切 提交于 2019-12-11 02:25:33
问题 (PySpark, either Spark 1.6 or 2.0, shared YARN cluster with dozens of nodes) I'd like to run a bootstrapping analysis, with each boot strap sample running on a dataset that's too large to fit on a single executor. The naive approach I was going to start with is: create spark dataframe of training dataset for i in (1,1000): use df.sample() to create a sample_df train the model (logistic classifier) on sample_df Although each individual model is fit across the cluster, this doesn't seem to be

How to save results of Model to text file?

℡╲_俬逩灬. 提交于 2019-12-11 02:22:14
问题 I am trying to save the Frequent itemsets generated from the model to a text file. The code is an example of FPGrowth example in Spark ML library. Using saveAsTextFile directly on the model writes the RDD locations and not the actual values. import org.apache.spark.mllib.fpm.FPGrowth import org.apache.spark.rdd.RDD val data = sc.textFile("/home/ponny/Freq") val data1 = sc.textFile("/home/ponny/Scala_Examples/test.txt") val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' '))

How to get the probabilities of classes in Spark Naive Bayes classifier?

好久不见. 提交于 2019-12-11 02:07:14
问题 I'm training a NaiveBayesModel in Spark, however when I'm using it to predict a new instance I need to get the probabilities for each class. I looked at the code of predict function in NaiveBayesModel and come up with the following code: val thetaMatrix = new DenseMatrix (model.labels.length,model.theta(0).length,model.theta.flatten,true) val piVector = new DenseVector(model.pi) //val prob = thetaMatrix.multiply(test.features) val x = test.map {p => val prob = thetaMatrix.multiply(p.features)