apache-spark-mllib | 易学教程

How to store the text file on the Master?

阅读更多关于 How to store the text file on the Master?

问题 I am using Standalone clusters to run the ALS algorithm. The predictions are being stored to the textfile using: saveAsTextFile(path) But the text file is being stored on the clusters. I want to store the text file on the Master. 回答1: That is expected behavior. path is resolved on the machine it is executed, the slaves. I'd recommend to either use a cluster FS (e.g. HDFS) or .collect() your data so you can save them locally on the master. Beware of OOM if your data is large. 来源： https:/

Queries with streaming sources must be executed with writeStream.start();;

阅读更多关于 Queries with streaming sources must be executed with writeStream.start();;

问题 I am trying to read data from Kafka using spark structured streaming and predict form incoming data. I'm using model which I have trained using Spark ML. val spark = SparkSession .builder() .appName("Spark SQL basic example") .master("local") .getOrCreate() import spark.implicits._ val toString = udf((payload: Array[Byte]) => new String(payload)) val sentenceDataFrame = spark.readStream.format("kafka").option("kafka.bootstrap.servers","localhost:9092").option("subscribe", "topicname1") .load(

How to declare a sparse Vector in Spark with Scala?

阅读更多关于 How to declare a sparse Vector in Spark with Scala?

问题 I'm trying to create a sparse Vector (the mllib.linalg.Vectors class, not the default one) but I can't understand how to use Seq. I have a small test file with three numbers/line, which I convert to an rdd, split the text in doubles and then group the lines by their first column. Test file 1 2 4 1 3 5 1 4 8 2 7 5 2 8 4 2 9 10 Code val data = sc.textFile("/home/savvas/DWDM/test.txt") val data2 = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))) val grouped = data2.groupBy( _(0) ) This

How to print best model params in pyspark pipeline

阅读更多关于 How to print best model params in pyspark pipeline

问题 This question is similar to this one. I would like to print the best model params after doing a TrainValidationSplit in pyspark. I cannot find the piece of text the other user uses to answer the question because I'm working on jupyter and the log dissapears from the terminal... Part of the code is: pca = PCA(inputCol = 'features') dt = DecisionTreeRegressor(featuresCol=pca.getOutputCol(), labelCol="energy") pipe = Pipeline(stages=[pca,dt]) paramgrid = ParamGridBuilder().addGrid(pca.k, range(1

Convert scala FP-growth RDD output to Data frame

阅读更多关于 Convert scala FP-growth RDD output to Data frame

问题 https://spark.apache.org/docs/2.1.0/mllib-frequent-pattern-mining.html#fp-growth sample_fpgrowth.txt can be found here, https://github.com/apache/spark/blob/master/data/mllib/sample_fpgrowth.txt I ran the FP-growth example in the link above in scala its working fine, but what i need is, how to convert the result which is in RDD to data frame. Both these RDD model.freqItemsets and model.generateAssociationRules(minConfidence) explain that in detail with the example given in my question. 回答1:

Sparse vector RDD in pyspark

阅读更多关于 Sparse vector RDD in pyspark

问题 I have been implementing the TF-IDF method described here with Python/Pyspark using feature from mllib: https://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html I have a training set of 150 text documents, a testing set of 80 text documents. I have produced a hash table TF-IDF RDD (of sparse vectors) for both training and testing i.e. bag of words representation called tfidf_train and tfidf_test. The IDF is shared between both and is based solely on the training data. My question

Creating Spark dataframe from numpy matrix

阅读更多关于 Creating Spark dataframe from numpy matrix

问题 it is my first time with PySpark, (Spark 2), and I'm trying to create a toy dataframe for a Logit model. I ran successfully the tutorial and would like to pass my own data into it. I've tried this: %pyspark import numpy as np from pyspark.ml.linalg import Vectors, VectorUDT from pyspark.mllib.regression import LabeledPoint df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1) df = map(lambda x:

How to set cutoff while training the data in Random Forest in Spark

阅读更多关于 How to set cutoff while training the data in Random Forest in Spark

问题 I am using Spark Mlib to train the data for classification using Random Forest Algorithm. The MLib provides a RandomForest Class which has trainClassifier Method which does the required. Can I set a threshold value while training the data set, similar to the cutoff option provided in R's randomForest Package. http://cran.r-project.org/web/packages/randomForest/randomForest.pdf I found the RandomForest Class of MLib provides options only to pass number of trees, impurity, number of classes etc

Unable to serialize logistic regressing in mleap

阅读更多关于 Unable to serialize logistic regressing in mleap

问题 java.lang.AssertionError: assertion failed: This op only supports binary logistic regression I am trying to serialize a spark pipeline in mleap. I am using Tokenizer, HashingTF and LogisticRegression in my pipeline. When I am trying to serialize my pipeline I get the above error. Here is the code I am using to serialize the pipeline - val pipeline = Pipeline(pipelineConfig) val model = pipeline.fit(data) (for(bf <- managed(BundleFile("jar:file:/tmp/abc.model.twitter.zip"))) yield { model

Failed to load class for data source: Libsvm in spark ML pyspark/scala

阅读更多关于 Failed to load class for data source: Libsvm in spark ML pyspark/scala

问题 When I try to import a libsvm file in pyspark/scala using "sqlContext.read.format("libsvm").load" , I get the following error - "Failed to load class for data source: Libsvm." At the same time, if I use "MLUtils.loadLibSVMFile" it works perfectly fine. I need to use both Spark ML (to get class probabilities) and MLlib for an evaluation. Have attached the error screenshot. This is a MapR cluster. Spark version 1.5.2 Error 回答1: libsvm source format is available since version 1.6 of Spark. 回答2: