apache-spark-mllib

How to use Spark MlLib/Pipelines to build 1 model per each user [duplicate]

北城余情 提交于 2019-12-11 01:20:01
问题 This question already has an answer here : Run ML algorithm inside map function in Spark (1 answer) Closed last year . I want to train different models for each user in my dataset. Is there built in support for that in Spark MlLib/Pipelines? If not, what's the easiest/cleanest way to train multiple and separate models for each user? 回答1: Unfortunately Spark-ML doesn't provide the ability to separate concept "single model - single user". But you can make a custom logic as you wish. I see two

How should I convert an RDD of org.apache.spark.ml.linalg.Vector to Dataset?

谁说我不能喝 提交于 2019-12-11 00:54:12
问题 I'm struggling to understand how the conversion among RDDs, DataSets and DataFrames works. I'm pretty new to Spark, and I get stuck every time I need to pass from a data model to another (especially from RDDs to Datasets and Dataframes). Could anyone explain me the right way to do it? As an example, now I have a RDD[org.apache.spark.ml.linalg.Vector] and I need to pass it to my machine learning algorithm, for example a KMeans (Spark DataSet MLlib). So, I need to convert it to Dataset with a

Spark MLib Word2Vec Error: The vocabulary size should be > 0

荒凉一梦 提交于 2019-12-11 00:18:39
问题 I am trying to implement word vectorization using Spark's MLLib. I am following the example given here. I have bunch of sentences which I want to give as input to train the model. But am not sure if this model takes sentences or just takes all the words as a sequence of string. My input is as below: scala> v.take(5) res31: Array[Seq[String]] = Array(List([WrappedArray(0_42)]), List([WrappedArray(big, baller, shoe, ?)]), List([WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from,

Convert from DataFrame to JavaPairRDD<Long, Vector>

元气小坏坏 提交于 2019-12-10 23:12:39
问题 I'm trying to implement LDA algorithm using apache spark with Java API. Method LDA().run() accept parameter JavaPairRDD documents. I have use scala for create RDD[(Long, Vector)] follow: val countVectors = cvModel.transform(filteredTokens) .select("docId", "features") .map { case Row(docId: Long, countVector: Vector) => (docId, countVector) } .cache() And then input into LDA: lda.run(countVectors) But in Java API, I have CountVectorizerModel by using follow code: CountVectorizerModel cvModel

';' expected but 'import' found - Scala and Spark

时光毁灭记忆、已成空白 提交于 2019-12-10 19:28:30
问题 I'm trying to work with Spark and Scala, compiling a standalone application. I don't know why I'm getting this error: topicModel.scala:2: ';' expected but 'import' found. [error] import org.apache.spark.mllib.clustering.LDA [error] ^ [error] one error found [error] (compile:compileIncremental) Compilation failed This is the build.sbt code: name := "topicModel" version := "1.0" scalaVersion := "2.11.6" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.1" libraryDependencies +=

How to print best model params in Apache Spark Pipeline?

删除回忆录丶 提交于 2019-12-10 17:34:51
问题 I'm using pipeline API of Apache Spark for validation of parameters. I'm building TrainValidationSplitModel like this : Pipeline pipeline = ... ParamMap[] paramGrid = ... TrainValidationSplit trainValidationSplit = new TrainValidationSplit().setEstimator(pipeline).setEvaluator(new MulticlassClassificationEvaluator()).setEstimatorParamMaps(paramGrid).setTrainRatio(0.8); TrainValidationSplitModel model = trainValidationSplit.fit(training); My question is: how can I extract and print params of

how do I preserve the key or index of input to Spark HashingTF() function?

青春壹個敷衍的年華 提交于 2019-12-10 16:57:50
问题 Based on the Spark documentation for 1.4 (https://spark.apache.org/docs/1.4.0/mllib-feature-extraction.html) I'm writing a TF-IDF example for converting text documents to vectors of values. The example given shows how this can be done but the input is a RDD of tokens with no keys . This means that my output RDD no longer contains an index or key to refer back to the original document. The example is this: documents = sc.textFile("...").map(lambda line: line.split(" ")) hashingTF = HashingTF()

Kolmogorov Smirnov Test in Spark (Python) not working?

*爱你&永不变心* 提交于 2019-12-10 16:19:09
问题 I was doing a normality test in Python spark-ml and saw what I think is an bug. Here is the setup, i have a data-set that is normalized (range -1, to 1). When I do a histogram, i can clearly see that the data is NOT normal: >>> prices_norm.histogram(10) ([-1.0, -0.8, -0.6, -0.4, -0.2, 0.0, 0.2, 0.4, 0.6, 0.8, 1.0], [226, 269, 119, 95, 52, 26, 8, 2, 2, 5]) When I run the Kolmgorov-Smirnov test I get the following results: >>> testResults = Statistics.kolmogorovSmirnovTest(prices_norm, "norm")

Why does spark-ml ALS model returns NaN and negative numbers predictions?

孤者浪人 提交于 2019-12-10 16:18:46
问题 Actually I'm trying to use ALS from spark-ml with implicit ratings. I noticed that some predictions given by my trained model are negative or NaN , why is it? 回答1: Apache Spark provides an option to force non negative constraints on ALS. Thus, to remove these negative values, you'll just need to set : Python: nonnegative=True Scala: setNonnegative(true) when creating your ALS model, i.e : >>> als = ALS(rank=10, maxIter=5, seed=0, nonnegative=True) Non-negative matrix factorization (NMF or

Combining Spark Streaming + MLlib

家住魔仙堡 提交于 2019-12-10 13:56:38
问题 I've tried to use a Random Forest model in order to predict a stream of examples, but it appears that I cannot use that model to classify the examples. Here is the code used in pyspark: sc = SparkContext(appName="App") model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, impurity='gini', numTrees=150) ssc = StreamingContext(sc, 1) lines = ssc.socketTextStream(hostname, int(port)) parsedLines = lines.map(parse) parsedLines.pprint() predictions =