apache-spark-mllib | 易学教程

How to plot ROC curve and precision-recall curve from BinaryClassificationMetrics

阅读更多关于 How to plot ROC curve and precision-recall curve from BinaryClassificationMetrics

问题 I was trying to plot ROC curve and Precision-Recall curve in graph. The points are generated from the Spark Mllib BinaryClassificationMetrics. By following the following Spark https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html [(1.0,1.0), (0.0,0.4444444444444444)] Precision [(1.0,1.0), (0.0,1.0)] Recall [(1.0,1.0), (0.0,0.6153846153846153)] - F1Measure [(0.0,1.0), (1.0,1.0), (1.0,0.4444444444444444)]- Precision-Recall curve [(0.0,0.0), (0.0,1.0), (1.0,1.0), (1.0,1.0)] - ROC

Calculating standard error of estimate, Wald-Chi Square statistic, p-value with logistic regression in Spark

阅读更多关于 Calculating standard error of estimate, Wald-Chi Square statistic, p-value with logistic regression in Spark

问题 I was trying to build Logistic regression model on a sample data. The output from the model we can get are the weights of features used to build the model. I could not find Spark API for standard error of estimate, Wald-Chi Square statistic, p-value etc. I am pasting my codes below as an example import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS import org.apache.spark.mllib.evaluation.{BinaryClassificationMetrics, MulticlassMetrics} import org.apache.spark.mllib.linalg

After installing sparknlp, cannot import sparknlp

阅读更多关于 After installing sparknlp, cannot import sparknlp

问题 The following ran successfully on a Cloudera CDSW cluster gateway. import pyspark from pyspark.sql import SparkSession spark = (SparkSession .builder .config("spark.jars.packages","JohnSnowLabs:spark-nlp:1.2.3") .getOrCreate() ) Which produces this output. Ivy Default Cache set to: /home/cdsw/.ivy2/cache The jars for the packages stored in: /home/cdsw/.ivy2/jars :: loading settings :: url = jar:file:/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/ivy-2.4.0

Debugging large task sizes in Spark MLlib

阅读更多关于 Debugging large task sizes in Spark MLlib

问题 In Apache Spark (Scala shell), I am attempting: val model = ALS.trainImplicit(training, rank, numIter) where training is a million-row file partitioned into 100 partitions, rank=20, and numIter=20. I get a string of messages of the form: WARN scheduler.TaskSetManager: Stage 2175 contains a task of very large size (101 KB). The maximum recommended task size is 100 KB. How do I go about debugging this? I've heard broadcast variables are useful in reducing task size, but in this case there's no

Converting a vector column in a dataframe back into an array column

阅读更多关于 Converting a vector column in a dataframe back into an array column

问题 I have a dataframe with two columns one of which (called dist) is a dense vector. How can I convert it back into an array column of integers. +---+-----+ | id| dist| +---+-----+ |1.0|[2.0]| |2.0|[4.0]| |3.0|[6.0]| |4.0|[8.0]| +---+-----+ I tried using several variants of the following udf but it returns a type mismatch error val toInt4 = udf[Int, Vector]({ (a) => (a)}) val result = df.withColumn("dist", toDf4(df("dist"))).select("dist") 回答1: I think it's easiest to do it by going to the RDD

Scaling each column of a dataframe

阅读更多关于 Scaling each column of a dataframe

问题 I am trying to scale every column of a dataframe. First I convert each column into a vector and then I use the ml MinMax Scaler. Is there a better/more elegant way to apply the same function to each column other than simply repeating it? import org.apache.spark.ml.linalg.Vectors import org.apache.spark.ml.linalg.DenseVector import org.apache.spark.sql.functions.udf import org.apache.spark.ml.feature.MinMaxScaler import org.apache.spark.sql.DataFrame val toVector = udf((vct:Double) => Vectors

How to generate tuples of (original label, predicted label) on Spark with MLlib?

阅读更多关于 How to generate tuples of (original label, predicted label) on Spark with MLlib?

问题 I am trying to make predictions with the model that I got back from MLlib on Spark. The goal is to generate tuples of (orinalLabelInData, predictedLabel). Then those tuples can be used for model evaluation purpose. What is the best way to achieve this? Thanks. Assuming parsedTrainData is a RDD of LabeledPoint from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.tree import DecisionTree, DecisionTreeModel from pyspark.mllib.util import MLUtils parsedTrainData = sc.parallelize(

Spark Matrix multiplication with python

阅读更多关于 Spark Matrix multiplication with python

问题 I am trying to do matrix multiplication using Apache Spark and Python. Here is my data from pyspark.mllib.linalg.distributed import RowMatrix My RDD of vectors rows_1 = sc.parallelize([[1, 2], [4, 5], [7, 8]]) rows_2 = sc.parallelize([[1, 2], [4, 5]]) My maxtrix mat1 = RowMatrix(rows_1) mat2 = RowMatrix(rows_2) I would like to do something like this: mat = mat1 * mat2 I wrote a function to process the matrix multiplication but I'm afraid to have a long processing time. Here is my function:

Spark - MLlib linear regression intercept and weight NaN [duplicate]

阅读更多关于 Spark - MLlib linear regression intercept and weight NaN [duplicate]

This question already has answers here : Spark MlLib linear regression (Linear least squares) giving random results (2 answers) Closed 3 years ago . I have trying to build a regression model on Spark using some custom data and the intercept and weights are always nan . This is my data: data = [LabeledPoint(0.0, [27022.0]), LabeledPoint(1.0, [27077.0]), LabeledPoint(2.0, [27327.0]), LabeledPoint(3.0, [27127.0])] Output: (weights=[nan], intercept=nan) However, if I use this dataset (taken from Spark examples), it returns a non nan weight and intercept. data = [LabeledPoint(0.0, [0.0]),

Formatting data for Spark ML

阅读更多关于 Formatting data for Spark ML

问题 I'm new to spark and Spark ML. I'm generated some data with the function KMeansDataGenerator.generateKMeansRDD but I fail when formatting those so that it can then be used by an ML algorithm (here it's K-Means). The error is Exception in thread "main" java.lang.IllegalArgumentException: Data type ArrayType(DoubleType,false) is not supported. It happens when using VectorAssembler. val generatedData = KMeansDataGenerator.generateKMeansRDD(sc, numPoints = 1000, k = 5, d = 3, r = 5, numPartitions