apache-spark-mllib

Spark Mlib FPGrowth job fails with Memory Error

杀马特。学长 韩版系。学妹 提交于 2019-12-18 06:43:20
问题 I have a fairly simple use case, but potentially very large result set. My code does the following (on pyspark shell): from pyspark.mllib.fpm import FPGrowth data = sc.textFile("/Users/me/associationtestproject/data/sourcedata.txt") transactions = data.map(lambda line: line.strip().split(' ')) model = FPGrowth.train(transactions, minSupport=0.000001, numPartitions=1000) # Perform any RDD operation for item in model.freqItemsets().toLocalIterator(): # do something with item I find that

PySpark & MLLib: Random Forest Feature Importances

核能气质少年 提交于 2019-12-17 23:25:24
问题 I'm trying to extract the feature importances of a random forest object I have trained using PySpark. However, I do not see an example of doing this anywhere in the documentation, nor is it a method of RandomForestModel. How can I extract feature importances from a RandomForestModel regressor or classifier in PySpark? Here's the sample code provided in the documentation to get us started; however, there is no mention of feature importances in it. from pyspark.mllib.tree import RandomForest

How to extract best parameters from a CrossValidatorModel

↘锁芯ラ 提交于 2019-12-17 21:54:07
问题 I want to find the parameters of ParamGridBuilder that make the best model in CrossValidator in Spark 1.4.x, In Pipeline Example in Spark documentation, they add different parameters ( numFeatures , regParam ) by using ParamGridBuilder in the Pipeline. Then by the following line of code they make the best model: val cvModel = crossval.fit(training.toDF) Now, I want to know what are the parameters ( numFeatures , regParam ) from ParamGridBuilder that produces the best model. I already used the

How to convert RDD of dense vector into DataFrame in pyspark?

江枫思渺然 提交于 2019-12-17 18:56:36
问题 I have a DenseVector RDD like this >>> frequencyDenseVectors.collect() [DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0])] I want to convert this into a Dataframe . I tried like this >>> spark.createDataFrame

How to cross validate RandomForest model?

倖福魔咒の 提交于 2019-12-17 18:34:50
问题 I want to evaluate a random forest being trained on some data. Is there any utility in Apache Spark to do the same or do I have to perform cross validation manually? 回答1: ML provides CrossValidator class which can be used to perform cross-validation and parameter search. Assuming your data is already preprocessed you can add cross-validation as follows: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator} import org.apache.spark.ml

Spark MLLib TFIDF implementation for LogisticRegression

做~自己de王妃 提交于 2019-12-17 18:24:44
问题 I try to use the new TFIDF algorithem that spark 1.1.0 offers. I'm writing my job for MLLib in Java but I can't figure out how to get the TFIDF implementation working. For some reason IDFModel only accepts a JavaRDD as input for the method transform and not simple Vector. How can I use the given classes to model a TFIDF vector for my LabledPoints? Note: The document lines are in the format [Label; Text] Here my code so far: // 1.) Load the documents JavaRDD<String> data = sc.textFile("/home

Spark Multiclass Classification Example

前提是你 提交于 2019-12-17 18:04:56
问题 Do you guys know where can I find examples of multiclass classification in Spark. I spent a lot of time searching in books and in the web, and so far I just know that it is possible since the latest version according the documentation. 回答1: ML ( Recommended in Spark 2.0+ ) We'll use the same data as in the MLlib below. There are two basic options. If Estimator supports multilclass classification out-of-the-box (for example random forest) you can use it directly: val trainRawDf = trainRaw.toDF

Spark MlLib linear regression (Linear least squares) giving random results

一世执手 提交于 2019-12-17 17:04:04
问题 Im new in spark and Machine learning in general. I have followed with success some of the Mllib tutorials, i can't get this one working: i found the sample code here : https://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression (section LinearRegressionWithSGD) here is the code: import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.regression.LinearRegressionModel import org.apache.spark.mllib.regression

Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-17 06:10:00
问题 i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document. 回答1: As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel . What you're going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where

How to handle categorical features with spark-ml?

天涯浪子 提交于 2019-12-17 02:34:30
问题 How do I handle categorical data with spark-ml and not spark-mllib ? Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier , LogisticRegression , have a featuresCol argument, which specifies the name of the column of features in the DataFrame , and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame . Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to