apache-spark-mllib

How to plot ROC curve and precision-recall curve from BinaryClassificationMetrics

寵の児 提交于 2019-12-10 13:14:33
问题 I was trying to plot ROC curve and Precision-Recall curve in graph. The points are generated from the Spark Mllib BinaryClassificationMetrics. By following the following Spark https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html [(1.0,1.0), (0.0,0.4444444444444444)] Precision [(1.0,1.0), (0.0,1.0)] Recall [(1.0,1.0), (0.0,0.6153846153846153)] - F1Measure [(0.0,1.0), (1.0,1.0), (1.0,0.4444444444444444)]- Precision-Recall curve [(0.0,0.0), (0.0,1.0), (1.0,1.0), (1.0,1.0)] - ROC

Calculating standard error of estimate, Wald-Chi Square statistic, p-value with logistic regression in Spark

懵懂的女人 提交于 2019-12-10 13:07:59
问题 I was trying to build Logistic regression model on a sample data. The output from the model we can get are the weights of features used to build the model. I could not find Spark API for standard error of estimate, Wald-Chi Square statistic, p-value etc. I am pasting my codes below as an example import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS import org.apache.spark.mllib.evaluation.{BinaryClassificationMetrics, MulticlassMetrics} import org.apache.spark.mllib.linalg

After installing sparknlp, cannot import sparknlp

[亡魂溺海] 提交于 2019-12-10 11:33:20
问题 The following ran successfully on a Cloudera CDSW cluster gateway. import pyspark from pyspark.sql import SparkSession spark = (SparkSession .builder .config("spark.jars.packages","JohnSnowLabs:spark-nlp:1.2.3") .getOrCreate() ) Which produces this output. Ivy Default Cache set to: /home/cdsw/.ivy2/cache The jars for the packages stored in: /home/cdsw/.ivy2/jars :: loading settings :: url = jar:file:/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/ivy-2.4.0

Debugging large task sizes in Spark MLlib

亡梦爱人 提交于 2019-12-10 09:56:23
问题 In Apache Spark (Scala shell), I am attempting: val model = ALS.trainImplicit(training, rank, numIter) where training is a million-row file partitioned into 100 partitions, rank=20, and numIter=20. I get a string of messages of the form: WARN scheduler.TaskSetManager: Stage 2175 contains a task of very large size (101 KB). The maximum recommended task size is 100 KB. How do I go about debugging this? I've heard broadcast variables are useful in reducing task size, but in this case there's no

Converting a vector column in a dataframe back into an array column

南笙酒味 提交于 2019-12-09 12:41:36
问题 I have a dataframe with two columns one of which (called dist) is a dense vector. How can I convert it back into an array column of integers. +---+-----+ | id| dist| +---+-----+ |1.0|[2.0]| |2.0|[4.0]| |3.0|[6.0]| |4.0|[8.0]| +---+-----+ I tried using several variants of the following udf but it returns a type mismatch error val toInt4 = udf[Int, Vector]({ (a) => (a)}) val result = df.withColumn("dist", toDf4(df("dist"))).select("dist") 回答1: I think it's easiest to do it by going to the RDD

Scaling each column of a dataframe

天涯浪子 提交于 2019-12-09 07:36:27
问题 I am trying to scale every column of a dataframe. First I convert each column into a vector and then I use the ml MinMax Scaler. Is there a better/more elegant way to apply the same function to each column other than simply repeating it? import org.apache.spark.ml.linalg.Vectors import org.apache.spark.ml.linalg.DenseVector import org.apache.spark.sql.functions.udf import org.apache.spark.ml.feature.MinMaxScaler import org.apache.spark.sql.DataFrame val toVector = udf((vct:Double) => Vectors

How to generate tuples of (original label, predicted label) on Spark with MLlib?

坚强是说给别人听的谎言 提交于 2019-12-09 01:38:10
问题 I am trying to make predictions with the model that I got back from MLlib on Spark. The goal is to generate tuples of (orinalLabelInData, predictedLabel). Then those tuples can be used for model evaluation purpose. What is the best way to achieve this? Thanks. Assuming parsedTrainData is a RDD of LabeledPoint from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.tree import DecisionTree, DecisionTreeModel from pyspark.mllib.util import MLUtils parsedTrainData = sc.parallelize(

Spark Matrix multiplication with python

我怕爱的太早我们不能终老 提交于 2019-12-09 00:13:25
问题 I am trying to do matrix multiplication using Apache Spark and Python. Here is my data from pyspark.mllib.linalg.distributed import RowMatrix My RDD of vectors rows_1 = sc.parallelize([[1, 2], [4, 5], [7, 8]]) rows_2 = sc.parallelize([[1, 2], [4, 5]]) My maxtrix mat1 = RowMatrix(rows_1) mat2 = RowMatrix(rows_2) I would like to do something like this: mat = mat1 * mat2 I wrote a function to process the matrix multiplication but I'm afraid to have a long processing time. Here is my function:

Spark - MLlib linear regression intercept and weight NaN [duplicate]

一曲冷凌霜 提交于 2019-12-08 17:08:36
This question already has answers here : Spark MlLib linear regression (Linear least squares) giving random results (2 answers) Closed 3 years ago . I have trying to build a regression model on Spark using some custom data and the intercept and weights are always nan . This is my data: data = [LabeledPoint(0.0, [27022.0]), LabeledPoint(1.0, [27077.0]), LabeledPoint(2.0, [27327.0]), LabeledPoint(3.0, [27127.0])] Output: (weights=[nan], intercept=nan) However, if I use this dataset (taken from Spark examples), it returns a non nan weight and intercept. data = [LabeledPoint(0.0, [0.0]),

Formatting data for Spark ML

萝らか妹 提交于 2019-12-08 09:33:46
问题 I'm new to spark and Spark ML. I'm generated some data with the function KMeansDataGenerator.generateKMeansRDD but I fail when formatting those so that it can then be used by an ML algorithm (here it's K-Means). The error is Exception in thread "main" java.lang.IllegalArgumentException: Data type ArrayType(DoubleType,false) is not supported. It happens when using VectorAssembler. val generatedData = KMeansDataGenerator.generateKMeansRDD(sc, numPoints = 1000, k = 5, d = 3, r = 5, numPartitions