apache-spark-mllib | 易学教程

What Type should the dense vector be, when using UDF function in Pyspark? [duplicate]

阅读更多关于 What Type should the dense vector be, when using UDF function in Pyspark? [duplicate]

问题 This question already has an answer here : How to convert ArrayType to DenseVector in PySpark DataFrame? (1 answer) Closed last year . I want to change List to Vector in pySpark, and then use this column to Machine Learning model for training. But my spark version is 1.6.0, which does not have VectorUDT() . So what type should I return in my udf function? from pyspark.sql import SQLContext from pyspark import SparkContext, SparkConf from pyspark.sql.functions import * from pyspark.mllib

Apache Spark — MlLib — Collaborative filtering

阅读更多关于 Apache Spark — MlLib — Collaborative filtering

I'm trying to use MlLib for my colloborative filtering. I encounter the following error in my Scala program when I run it in Apache Spark 1.0.0. 14/07/15 16:16:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/15 16:16:31 WARN LoadSnappy: Snappy native library not loaded 14/07/15 16:16:31 INFO FileInputFormat: Total input paths to process : 1 14/07/15 16:16:38 WARN TaskSetManager: Lost TID 10 (task 80.0:0) 14/07/15 16:16:38 WARN TaskSetManager: Loss was due to java.lang.UnsatisfiedLinkError java.lang

Converting RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector]

阅读更多关于 Converting RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector]

I am relatively new to Spark and Scala. I am starting with the following dataframe (single column made out of a dense Vector of Doubles): scala> val scaledDataOnly_pruned = scaledDataOnly.select("features") scaledDataOnly_pruned: org.apache.spark.sql.DataFrame = [features: vector] scala> scaledDataOnly_pruned.show(5) +--------------------+ | features| +--------------------+ |[-0.0948337274182...| |[-0.0948337274182...| |[-0.0948337274182...| |[-0.0948337274182...| |[-0.0948337274182...| +--------------------+ A straight conversion to RDD yields an instance of org.apache.spark.rdd.RDD[org

RandomForestClassifier was given input with invalid label column error in Apache Spark

阅读更多关于 RandomForestClassifier was given input with invalid label column error in Apache Spark

问题 I am trying to find Accuracy using 5-fold cross validation using Random Forest Classifier Model in SCALA. But i am getting the following error while running: java.lang.IllegalArgumentException: RandomForestClassifier was given input with invalid label column label, without the number of classes specified. See StringIndexer. Getting the above error at line---> val cvModel = cv.fit(trainingData) The code which i used for cross validation of data set using random forest is as follows: import org

Apache Spark — MlLib — Collaborative filtering

阅读更多关于 Apache Spark — MlLib — Collaborative filtering

问题 I'm trying to use MlLib for my colloborative filtering. I encounter the following error in my Scala program when I run it in Apache Spark 1.0.0. 14/07/15 16:16:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/15 16:16:31 WARN LoadSnappy: Snappy native library not loaded 14/07/15 16:16:31 INFO FileInputFormat: Total input paths to process : 1 14/07/15 16:16:38 WARN TaskSetManager: Lost TID 10 (task 80.0:0) 14

How to generate tuples of (original label, predicted label) on Spark with MLlib?

阅读更多关于 How to generate tuples of (original label, predicted label) on Spark with MLlib?

I am trying to make predictions with the model that I got back from MLlib on Spark. The goal is to generate tuples of (orinalLabelInData, predictedLabel). Then those tuples can be used for model evaluation purpose. What is the best way to achieve this? Thanks. Assuming parsedTrainData is a RDD of LabeledPoint from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.tree import DecisionTree, DecisionTreeModel from pyspark.mllib.util import MLUtils parsedTrainData = sc.parallelize([LabeledPoint(1.0, [11.0,-12.0,23.0]), LabeledPoint(3.0, [-1.0,12.0,-23.0])]) model = DecisionTree

How to do Stratified sampling with Spark DataFrames? [duplicate]

阅读更多关于 How to do Stratified sampling with Spark DataFrames? [duplicate]

This question already has an answer here: Stratified sampling in Spark 2 answers I'm in Spark 1.3.0 and my data is in DataFrames. I need operations like sampleByKey(), sampleByKeyExact(). I saw the JIRA "Add approximate stratified sampling to DataFrame" ( https://issues.apache.org/jira/browse/SPARK-7157 ). That's targeted for Spark 1.5, till that comes through, whats the easiest way to accomplish the equivalent of sampleByKey() and sampleByKeyExact() on DataFrames. Thanks & Regards MK Spark 1.1 added stratified sampling routines SampleByKey and SampleByKeyExact to Spark Core, so since then

RDD transformations and actions can only be invoked by the driver

阅读更多关于 RDD transformations and actions can only be invoked by the driver

Error: org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. def computeRatio(model: MatrixFactorizationModel, test_data: org.apache.spark.rdd.RDD[Rating]): Double = { val numDistinctUsers = test_data.map(x => x.user).distinct().count() val userRecs: RDD[(Int, Set[Int], Set[Int])] = test_data.groupBy

Save Apache Spark mllib model in python [duplicate]

阅读更多关于 Save Apache Spark mllib model in python [duplicate]

This question already has an answer here: How to save and load MLLib model in Apache Spark? 1 answer I am trying to save a fitted model to a file in Spark. I have a Spark cluster which trains a RandomForest model. I would like to save and reuse the fitted model on another machine. I read some posts on the web which recommends to do java serialization. I am doing the equivalent in python but it does not work. What is the trick? model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={}, numTrees=nb_tree,featureSubsetStrategy="auto", impurity='variance', maxDepth=depth) output

How to do Stratified sampling with Spark DataFrames? [duplicate]

阅读更多关于 How to do Stratified sampling with Spark DataFrames? [duplicate]

问题 This question already has answers here : Stratified sampling in Spark (2 answers) Closed last year . I'm in Spark 1.3.0 and my data is in DataFrames. I need operations like sampleByKey(), sampleByKeyExact(). I saw the JIRA "Add approximate stratified sampling to DataFrame" (https://issues.apache.org/jira/browse/SPARK-7157). That's targeted for Spark 1.5, till that comes through, whats the easiest way to accomplish the equivalent of sampleByKey() and sampleByKeyExact() on DataFrames. Thanks &