apache-spark-ml

Cannot run RandomForestClassifier from spark ML on a simple example

无人久伴 提交于 2019-12-12 08:15:09
问题 I have tried to run the experimental RandomForestClassifier from the spark.ml package (version 1.5.2). The dataset I used is from the LogisticRegression example in the Spark ML guide. Here is the code: import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.param.ParamMap import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.sql.Row // Prepare training data from a list of (label, features) tuples. val training = sqlContext

Failed to load class for data source: Libsvm in spark ML pyspark/scala

為{幸葍}努か 提交于 2019-12-12 04:16:32
问题 When I try to import a libsvm file in pyspark/scala using "sqlContext.read.format("libsvm").load" , I get the following error - "Failed to load class for data source: Libsvm." At the same time, if I use "MLUtils.loadLibSVMFile" it works perfectly fine. I need to use both Spark ML (to get class probabilities) and MLlib for an evaluation. Have attached the error screenshot. This is a MapR cluster. Spark version 1.5.2 Error 回答1: libsvm source format is available since version 1.6 of Spark. 回答2:

pyspark add new column field with the data frame row number

心不动则不痛 提交于 2019-12-12 02:54:09
问题 Hy, I'm trying build a recommendation system with Spark I have a data frame with users email and movie rating. df = pd.DataFrame(np.array([["aa@gmail.com",2,3],["aa@gmail.com",5,5],["bb@gmail.com",8,2],["cc@gmail.com",9,3]]), columns=['user','movie','rating']) sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1) user movie rating aa@gmail.com 2 3 aa@gmail.com 5 5 bb@gmail.com 8 2 cc@gmail.com 9 3 My first doubt it is, pySpark MLlib doesn't accept emails I'm correct? Because this I need

How to create a Spark DataFrame inside a custom PySpark ML Pipeline _transform() method?

隐身守侯 提交于 2019-12-11 16:25:32
问题 In Spark's ML Pipelines the transformer's transform() method takes a Spark DataFrame and returns a DataFrame . My custom _transform() method uses the DataFrame that's passed in to create an RDD before processing it. This means the results of my algorithm have to be converted back into a DataFrame before being returned from _transform() . So how should I create the DataFrame from the RDD inside _transform() ? Normally I would use SparkSession.createDataFrame(). But this means passing a

Logistic regression with spark ml (data frames)

牧云@^-^@ 提交于 2019-12-11 13:07:33
问题 I wrote the following code for logistic regression, I want to use the pipeline API provided by spark.ml . However it gave me an error after I try to print coefficients and intercepts. Also I am having trouble computing the confusion matrix and other metrics like precision, recall. #Logistic Regression: from pyspark.mllib.linalg import Vectors from pyspark.ml.classification import LogisticRegression from pyspark.sql import SQLContext from pyspark import SparkContext from pyspark.sql.types

Spark Scala Error while saving DataFrame to Hive

こ雲淡風輕ζ 提交于 2019-12-11 12:16:34
问题 i have framed a DataFrame by combining multiple Arrays. I am trying to save this into a hive table, i am getting ArrayIndexOutofBound Exception. Following is the code and the Error i got. i tried with adding case class outside and inside main def but still getting the same error. import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{Row, SQLContext, DataFrame} import org.apache.spark.ml.feature.RFormula import java.text._ import java.util.Date import org.apache.hadoop

Spark 2.1.0, ML RandomForest: java.lang.UnsupportedOperationException: empty.maxBy

自古美人都是妖i 提交于 2019-12-11 09:05:14
问题 I am trying to fit an ML Cross Validator on a DataFrame of the following Schema: root |-- userID: string (nullable = true) |-- features: vector (nullable = true) |-- label: double (nullable = true) I am getting a java.lang.UnsupportedOperationException: empty.maxBy when I fit the CrossValidator. I have read this bug report, it says that this exception happens there is no feautres: In the case of empty features we fail with a better error message stating: DecisionTree requires number of

How to provide multiple columns to setInputCol()

岁酱吖の 提交于 2019-12-11 07:37:46
问题 I am very new to Spark Machine Learning I want to pass multiple columns to features, in my below code I am only passing the Date column to features but now I want to pass Userid and Date columns to features. I tried to Use Vector but It only support Double data type but in My case I have Int and String I would be thankful if anyone provide any suggestion/solution or any code example which will fulfill my requirement Code: case class LabeledDocument(Userid: Double, Date: String, label: Double)

Jaro-Winkler score calculation in Apache Spark

北战南征 提交于 2019-12-11 06:08:13
问题 We need to implement Jaro-Winkler distance calculation across string in Apache Spark Dataset . We are new to spark and after searching in web we are not able to find much. It would be great if you can guide us. We thought of using flatMap then realized it won’t help, then we tried to use couple of foreach loops but not able to figure how to go forward. As each of the string has to be compared against all. Like in the below dataset. RowFactory.create(0, "Hi I heard about Spark"), RowFactory

Spark|ML|Random Forest|Load trained model from .txt of RandomForestClassificationModel. toDebugString

こ雲淡風輕ζ 提交于 2019-12-11 05:42:59
问题 Using Spark 1.6 and the ML library I am saving the results of a trained RandomForestClassificationModel using toDebugString() : val rfModel = model.stages(2).asInstanceOf[RandomForestClassificationModel] val stringModel =rfModel.toDebugString //save stringModel into a file in the driver in format .txt So my idea is that in the future read the file .txt and load the trained randomForest, is it possible? thanks! 回答1: That won't work. ToDebugString is merely a debug info to understand how it's