apache-spark-mllib

How to store the text file on the Master?

帅比萌擦擦* 提交于 2019-12-13 02:38:12
问题 I am using Standalone clusters to run the ALS algorithm. The predictions are being stored to the textfile using: saveAsTextFile(path) But the text file is being stored on the clusters. I want to store the text file on the Master. 回答1: That is expected behavior. path is resolved on the machine it is executed, the slaves. I'd recommend to either use a cluster FS (e.g. HDFS) or .collect() your data so you can save them locally on the master. Beware of OOM if your data is large. 来源: https:/

Queries with streaming sources must be executed with writeStream.start();;

旧城冷巷雨未停 提交于 2019-12-12 20:37:41
问题 I am trying to read data from Kafka using spark structured streaming and predict form incoming data. I'm using model which I have trained using Spark ML. val spark = SparkSession .builder() .appName("Spark SQL basic example") .master("local") .getOrCreate() import spark.implicits._ val toString = udf((payload: Array[Byte]) => new String(payload)) val sentenceDataFrame = spark.readStream.format("kafka").option("kafka.bootstrap.servers","localhost:9092").option("subscribe", "topicname1") .load(

How to declare a sparse Vector in Spark with Scala?

…衆ロ難τιáo~ 提交于 2019-12-12 19:33:15
问题 I'm trying to create a sparse Vector (the mllib.linalg.Vectors class, not the default one) but I can't understand how to use Seq. I have a small test file with three numbers/line, which I convert to an rdd, split the text in doubles and then group the lines by their first column. Test file 1 2 4 1 3 5 1 4 8 2 7 5 2 8 4 2 9 10 Code val data = sc.textFile("/home/savvas/DWDM/test.txt") val data2 = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))) val grouped = data2.groupBy( _(0) ) This

How to print best model params in pyspark pipeline

允我心安 提交于 2019-12-12 19:08:10
问题 This question is similar to this one. I would like to print the best model params after doing a TrainValidationSplit in pyspark. I cannot find the piece of text the other user uses to answer the question because I'm working on jupyter and the log dissapears from the terminal... Part of the code is: pca = PCA(inputCol = 'features') dt = DecisionTreeRegressor(featuresCol=pca.getOutputCol(), labelCol="energy") pipe = Pipeline(stages=[pca,dt]) paramgrid = ParamGridBuilder().addGrid(pca.k, range(1

Convert scala FP-growth RDD output to Data frame

为君一笑 提交于 2019-12-12 13:40:13
问题 https://spark.apache.org/docs/2.1.0/mllib-frequent-pattern-mining.html#fp-growth sample_fpgrowth.txt can be found here, https://github.com/apache/spark/blob/master/data/mllib/sample_fpgrowth.txt I ran the FP-growth example in the link above in scala its working fine, but what i need is, how to convert the result which is in RDD to data frame. Both these RDD model.freqItemsets and model.generateAssociationRules(minConfidence) explain that in detail with the example given in my question. 回答1:

Sparse vector RDD in pyspark

余生长醉 提交于 2019-12-12 12:26:31
问题 I have been implementing the TF-IDF method described here with Python/Pyspark using feature from mllib: https://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html I have a training set of 150 text documents, a testing set of 80 text documents. I have produced a hash table TF-IDF RDD (of sparse vectors) for both training and testing i.e. bag of words representation called tfidf_train and tfidf_test. The IDF is shared between both and is based solely on the training data. My question

Creating Spark dataframe from numpy matrix

本小妞迷上赌 提交于 2019-12-12 10:30:21
问题 it is my first time with PySpark, (Spark 2), and I'm trying to create a toy dataframe for a Logit model. I ran successfully the tutorial and would like to pass my own data into it. I've tried this: %pyspark import numpy as np from pyspark.ml.linalg import Vectors, VectorUDT from pyspark.mllib.regression import LabeledPoint df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1) df = map(lambda x:

How to set cutoff while training the data in Random Forest in Spark

放肆的年华 提交于 2019-12-12 06:01:09
问题 I am using Spark Mlib to train the data for classification using Random Forest Algorithm. The MLib provides a RandomForest Class which has trainClassifier Method which does the required. Can I set a threshold value while training the data set, similar to the cutoff option provided in R's randomForest Package. http://cran.r-project.org/web/packages/randomForest/randomForest.pdf I found the RandomForest Class of MLib provides options only to pass number of trees, impurity, number of classes etc

Unable to serialize logistic regressing in mleap

拈花ヽ惹草 提交于 2019-12-12 04:49:19
问题 java.lang.AssertionError: assertion failed: This op only supports binary logistic regression I am trying to serialize a spark pipeline in mleap. I am using Tokenizer, HashingTF and LogisticRegression in my pipeline. When I am trying to serialize my pipeline I get the above error. Here is the code I am using to serialize the pipeline - val pipeline = Pipeline(pipelineConfig) val model = pipeline.fit(data) (for(bf <- managed(BundleFile("jar:file:/tmp/abc.model.twitter.zip"))) yield { model

Failed to load class for data source: Libsvm in spark ML pyspark/scala

為{幸葍}努か 提交于 2019-12-12 04:16:32
问题 When I try to import a libsvm file in pyspark/scala using "sqlContext.read.format("libsvm").load" , I get the following error - "Failed to load class for data source: Libsvm." At the same time, if I use "MLUtils.loadLibSVMFile" it works perfectly fine. I need to use both Spark ML (to get class probabilities) and MLlib for an evaluation. Have attached the error screenshot. This is a MapR cluster. Spark version 1.5.2 Error 回答1: libsvm source format is available since version 1.6 of Spark. 回答2: