apache-spark-ml | 易学教程

Cannot run RandomForestClassifier from spark ML on a simple example

阅读更多关于 Cannot run RandomForestClassifier from spark ML on a simple example

问题 I have tried to run the experimental RandomForestClassifier from the spark.ml package (version 1.5.2). The dataset I used is from the LogisticRegression example in the Spark ML guide. Here is the code: import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.param.ParamMap import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.sql.Row // Prepare training data from a list of (label, features) tuples. val training = sqlContext

Failed to load class for data source: Libsvm in spark ML pyspark/scala

阅读更多关于 Failed to load class for data source: Libsvm in spark ML pyspark/scala

问题 When I try to import a libsvm file in pyspark/scala using "sqlContext.read.format("libsvm").load" , I get the following error - "Failed to load class for data source: Libsvm." At the same time, if I use "MLUtils.loadLibSVMFile" it works perfectly fine. I need to use both Spark ML (to get class probabilities) and MLlib for an evaluation. Have attached the error screenshot. This is a MapR cluster. Spark version 1.5.2 Error 回答1: libsvm source format is available since version 1.6 of Spark. 回答2:

pyspark add new column field with the data frame row number

阅读更多关于 pyspark add new column field with the data frame row number

问题 Hy, I'm trying build a recommendation system with Spark I have a data frame with users email and movie rating. df = pd.DataFrame(np.array([["aa@gmail.com",2,3],["aa@gmail.com",5,5],["bb@gmail.com",8,2],["cc@gmail.com",9,3]]), columns=['user','movie','rating']) sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1) user movie rating aa@gmail.com 2 3 aa@gmail.com 5 5 bb@gmail.com 8 2 cc@gmail.com 9 3 My first doubt it is, pySpark MLlib doesn't accept emails I'm correct? Because this I need

How to create a Spark DataFrame inside a custom PySpark ML Pipeline _transform() method?

阅读更多关于 How to create a Spark DataFrame inside a custom PySpark ML Pipeline _transform() method?

问题 In Spark's ML Pipelines the transformer's transform() method takes a Spark DataFrame and returns a DataFrame . My custom _transform() method uses the DataFrame that's passed in to create an RDD before processing it. This means the results of my algorithm have to be converted back into a DataFrame before being returned from _transform() . So how should I create the DataFrame from the RDD inside _transform() ? Normally I would use SparkSession.createDataFrame(). But this means passing a

Logistic regression with spark ml (data frames)

阅读更多关于 Logistic regression with spark ml (data frames)

问题 I wrote the following code for logistic regression, I want to use the pipeline API provided by spark.ml . However it gave me an error after I try to print coefficients and intercepts. Also I am having trouble computing the confusion matrix and other metrics like precision, recall. #Logistic Regression: from pyspark.mllib.linalg import Vectors from pyspark.ml.classification import LogisticRegression from pyspark.sql import SQLContext from pyspark import SparkContext from pyspark.sql.types

Spark Scala Error while saving DataFrame to Hive

阅读更多关于 Spark Scala Error while saving DataFrame to Hive

问题 i have framed a DataFrame by combining multiple Arrays. I am trying to save this into a hive table, i am getting ArrayIndexOutofBound Exception. Following is the code and the Error i got. i tried with adding case class outside and inside main def but still getting the same error. import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{Row, SQLContext, DataFrame} import org.apache.spark.ml.feature.RFormula import java.text._ import java.util.Date import org.apache.hadoop

Spark 2.1.0, ML RandomForest: java.lang.UnsupportedOperationException: empty.maxBy

阅读更多关于 Spark 2.1.0, ML RandomForest: java.lang.UnsupportedOperationException: empty.maxBy

问题 I am trying to fit an ML Cross Validator on a DataFrame of the following Schema: root |-- userID: string (nullable = true) |-- features: vector (nullable = true) |-- label: double (nullable = true) I am getting a java.lang.UnsupportedOperationException: empty.maxBy when I fit the CrossValidator. I have read this bug report, it says that this exception happens there is no feautres: In the case of empty features we fail with a better error message stating: DecisionTree requires number of

How to provide multiple columns to setInputCol()

阅读更多关于 How to provide multiple columns to setInputCol()

问题 I am very new to Spark Machine Learning I want to pass multiple columns to features, in my below code I am only passing the Date column to features but now I want to pass Userid and Date columns to features. I tried to Use Vector but It only support Double data type but in My case I have Int and String I would be thankful if anyone provide any suggestion/solution or any code example which will fulfill my requirement Code: case class LabeledDocument(Userid: Double, Date: String, label: Double)

Jaro-Winkler score calculation in Apache Spark

阅读更多关于 Jaro-Winkler score calculation in Apache Spark

问题 We need to implement Jaro-Winkler distance calculation across string in Apache Spark Dataset . We are new to spark and after searching in web we are not able to find much. It would be great if you can guide us. We thought of using flatMap then realized it won’t help, then we tried to use couple of foreach loops but not able to figure how to go forward. As each of the string has to be compared against all. Like in the below dataset. RowFactory.create(0, "Hi I heard about Spark"), RowFactory

Spark|ML|Random Forest|Load trained model from .txt of RandomForestClassificationModel. toDebugString

阅读更多关于 Spark|ML|Random Forest|Load trained model from .txt of RandomForestClassificationModel. toDebugString

问题 Using Spark 1.6 and the ML library I am saving the results of a trained RandomForestClassificationModel using toDebugString() : val rfModel = model.stages(2).asInstanceOf[RandomForestClassificationModel] val stringModel =rfModel.toDebugString //save stringModel into a file in the driver in format .txt So my idea is that in the future read the file .txt and load the trained randomForest, is it possible? thanks! 回答1: That won't work. ToDebugString is merely a debug info to understand how it's