apache-spark-mllib | 易学教程

Spark Mlib FPGrowth job fails with Memory Error

阅读更多关于 Spark Mlib FPGrowth job fails with Memory Error

问题 I have a fairly simple use case, but potentially very large result set. My code does the following (on pyspark shell): from pyspark.mllib.fpm import FPGrowth data = sc.textFile("/Users/me/associationtestproject/data/sourcedata.txt") transactions = data.map(lambda line: line.strip().split(' ')) model = FPGrowth.train(transactions, minSupport=0.000001, numPartitions=1000) # Perform any RDD operation for item in model.freqItemsets().toLocalIterator(): # do something with item I find that

PySpark & MLLib: Random Forest Feature Importances

阅读更多关于 PySpark & MLLib: Random Forest Feature Importances

问题 I'm trying to extract the feature importances of a random forest object I have trained using PySpark. However, I do not see an example of doing this anywhere in the documentation, nor is it a method of RandomForestModel. How can I extract feature importances from a RandomForestModel regressor or classifier in PySpark? Here's the sample code provided in the documentation to get us started; however, there is no mention of feature importances in it. from pyspark.mllib.tree import RandomForest

How to extract best parameters from a CrossValidatorModel

阅读更多关于 How to extract best parameters from a CrossValidatorModel

问题 I want to find the parameters of ParamGridBuilder that make the best model in CrossValidator in Spark 1.4.x, In Pipeline Example in Spark documentation, they add different parameters ( numFeatures , regParam ) by using ParamGridBuilder in the Pipeline. Then by the following line of code they make the best model: val cvModel = crossval.fit(training.toDF) Now, I want to know what are the parameters ( numFeatures , regParam ) from ParamGridBuilder that produces the best model. I already used the

How to convert RDD of dense vector into DataFrame in pyspark?

阅读更多关于 How to convert RDD of dense vector into DataFrame in pyspark?

问题 I have a DenseVector RDD like this >>> frequencyDenseVectors.collect() [DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0])] I want to convert this into a Dataframe . I tried like this >>> spark.createDataFrame

How to cross validate RandomForest model?

阅读更多关于 How to cross validate RandomForest model?

问题 I want to evaluate a random forest being trained on some data. Is there any utility in Apache Spark to do the same or do I have to perform cross validation manually? 回答1: ML provides CrossValidator class which can be used to perform cross-validation and parameter search. Assuming your data is already preprocessed you can add cross-validation as follows: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator} import org.apache.spark.ml

Spark MLLib TFIDF implementation for LogisticRegression

阅读更多关于 Spark MLLib TFIDF implementation for LogisticRegression

问题 I try to use the new TFIDF algorithem that spark 1.1.0 offers. I'm writing my job for MLLib in Java but I can't figure out how to get the TFIDF implementation working. For some reason IDFModel only accepts a JavaRDD as input for the method transform and not simple Vector. How can I use the given classes to model a TFIDF vector for my LabledPoints? Note: The document lines are in the format [Label; Text] Here my code so far: // 1.) Load the documents JavaRDD<String> data = sc.textFile("/home

Spark Multiclass Classification Example

阅读更多关于 Spark Multiclass Classification Example

问题 Do you guys know where can I find examples of multiclass classification in Spark. I spent a lot of time searching in books and in the web, and so far I just know that it is possible since the latest version according the documentation. 回答1: ML ( Recommended in Spark 2.0+ ) We'll use the same data as in the MLlib below. There are two basic options. If Estimator supports multilclass classification out-of-the-box (for example random forest) you can use it directly: val trainRawDf = trainRaw.toDF

Spark MlLib linear regression (Linear least squares) giving random results

阅读更多关于 Spark MlLib linear regression (Linear least squares) giving random results

问题 Im new in spark and Machine learning in general. I have followed with success some of the Mllib tutorials, i can't get this one working: i found the sample code here : https://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression (section LinearRegressionWithSGD) here is the code: import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.regression.LinearRegressionModel import org.apache.spark.mllib.regression

Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

阅读更多关于 Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

问题 i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document. 回答1: As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel . What you're going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where

How to handle categorical features with spark-ml?

阅读更多关于 How to handle categorical features with spark-ml?

问题 How do I handle categorical data with spark-ml and not spark-mllib ? Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier , LogisticRegression , have a featuresCol argument, which specifies the name of the column of features in the DataFrame , and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame . Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to