apache-spark-mllib | 易学教程

How to use Spark MlLib/Pipelines to build 1 model per each user [duplicate]

阅读更多关于 How to use Spark MlLib/Pipelines to build 1 model per each user [duplicate]

问题 This question already has an answer here : Run ML algorithm inside map function in Spark (1 answer) Closed last year . I want to train different models for each user in my dataset. Is there built in support for that in Spark MlLib/Pipelines? If not, what's the easiest/cleanest way to train multiple and separate models for each user? 回答1: Unfortunately Spark-ML doesn't provide the ability to separate concept "single model - single user". But you can make a custom logic as you wish. I see two

How should I convert an RDD of org.apache.spark.ml.linalg.Vector to Dataset?

阅读更多关于 How should I convert an RDD of org.apache.spark.ml.linalg.Vector to Dataset?

问题 I'm struggling to understand how the conversion among RDDs, DataSets and DataFrames works. I'm pretty new to Spark, and I get stuck every time I need to pass from a data model to another (especially from RDDs to Datasets and Dataframes). Could anyone explain me the right way to do it? As an example, now I have a RDD[org.apache.spark.ml.linalg.Vector] and I need to pass it to my machine learning algorithm, for example a KMeans (Spark DataSet MLlib). So, I need to convert it to Dataset with a

Spark MLib Word2Vec Error: The vocabulary size should be > 0

阅读更多关于 Spark MLib Word2Vec Error: The vocabulary size should be > 0

问题 I am trying to implement word vectorization using Spark's MLLib. I am following the example given here. I have bunch of sentences which I want to give as input to train the model. But am not sure if this model takes sentences or just takes all the words as a sequence of string. My input is as below: scala> v.take(5) res31: Array[Seq[String]] = Array(List([WrappedArray(0_42)]), List([WrappedArray(big, baller, shoe, ?)]), List([WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from,

Convert from DataFrame to JavaPairRDD<Long, Vector>

阅读更多关于 Convert from DataFrame to JavaPairRDD

问题 I'm trying to implement LDA algorithm using apache spark with Java API. Method LDA().run() accept parameter JavaPairRDD documents. I have use scala for create RDD[(Long, Vector)] follow: val countVectors = cvModel.transform(filteredTokens) .select("docId", "features") .map { case Row(docId: Long, countVector: Vector) => (docId, countVector) } .cache() And then input into LDA: lda.run(countVectors) But in Java API, I have CountVectorizerModel by using follow code: CountVectorizerModel cvModel

';' expected but 'import' found - Scala and Spark

阅读更多关于 ';' expected but 'import' found - Scala and Spark

问题 I'm trying to work with Spark and Scala, compiling a standalone application. I don't know why I'm getting this error: topicModel.scala:2: ';' expected but 'import' found. [error] import org.apache.spark.mllib.clustering.LDA [error] ^ [error] one error found [error] (compile:compileIncremental) Compilation failed This is the build.sbt code: name := "topicModel" version := "1.0" scalaVersion := "2.11.6" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.1" libraryDependencies +=

How to print best model params in Apache Spark Pipeline?

阅读更多关于 How to print best model params in Apache Spark Pipeline?

问题 I'm using pipeline API of Apache Spark for validation of parameters. I'm building TrainValidationSplitModel like this : Pipeline pipeline = ... ParamMap[] paramGrid = ... TrainValidationSplit trainValidationSplit = new TrainValidationSplit().setEstimator(pipeline).setEvaluator(new MulticlassClassificationEvaluator()).setEstimatorParamMaps(paramGrid).setTrainRatio(0.8); TrainValidationSplitModel model = trainValidationSplit.fit(training); My question is: how can I extract and print params of

how do I preserve the key or index of input to Spark HashingTF() function?

阅读更多关于 how do I preserve the key or index of input to Spark HashingTF() function?

问题 Based on the Spark documentation for 1.4 (https://spark.apache.org/docs/1.4.0/mllib-feature-extraction.html) I'm writing a TF-IDF example for converting text documents to vectors of values. The example given shows how this can be done but the input is a RDD of tokens with no keys . This means that my output RDD no longer contains an index or key to refer back to the original document. The example is this: documents = sc.textFile("...").map(lambda line: line.split(" ")) hashingTF = HashingTF()

Kolmogorov Smirnov Test in Spark (Python) not working?

阅读更多关于 Kolmogorov Smirnov Test in Spark (Python) not working?

问题 I was doing a normality test in Python spark-ml and saw what I think is an bug. Here is the setup, i have a data-set that is normalized (range -1, to 1). When I do a histogram, i can clearly see that the data is NOT normal: >>> prices_norm.histogram(10) ([-1.0, -0.8, -0.6, -0.4, -0.2, 0.0, 0.2, 0.4, 0.6, 0.8, 1.0], [226, 269, 119, 95, 52, 26, 8, 2, 2, 5]) When I run the Kolmgorov-Smirnov test I get the following results: >>> testResults = Statistics.kolmogorovSmirnovTest(prices_norm, "norm")

Why does spark-ml ALS model returns NaN and negative numbers predictions?

阅读更多关于 Why does spark-ml ALS model returns NaN and negative numbers predictions?

问题 Actually I'm trying to use ALS from spark-ml with implicit ratings. I noticed that some predictions given by my trained model are negative or NaN , why is it? 回答1: Apache Spark provides an option to force non negative constraints on ALS. Thus, to remove these negative values, you'll just need to set : Python: nonnegative=True Scala: setNonnegative(true) when creating your ALS model, i.e : >>> als = ALS(rank=10, maxIter=5, seed=0, nonnegative=True) Non-negative matrix factorization (NMF or

Combining Spark Streaming + MLlib

阅读更多关于 Combining Spark Streaming + MLlib

问题 I've tried to use a Random Forest model in order to predict a stream of examples, but it appears that I cannot use that model to classify the examples. Here is the code used in pyspark: sc = SparkContext(appName="App") model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, impurity='gini', numTrees=150) ssc = StreamingContext(sc, 1) lines = ssc.socketTextStream(hostname, int(port)) parsedLines = lines.map(parse) parsedLines.pprint() predictions =