apache-spark-mllib | 易学教程

Use of foreachActive for spark Vector in Java

阅读更多关于 Use of foreachActive for spark Vector in Java

问题 How to write simple code in Java which iterate over active elements in sparse vector? Lets say we have such Vector: Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0}); I was trying with lambda or Function2 (from three different imports but always failed). If you use Function2 please provide necessary import. 回答1: Adrian, here is how you can use the foreachActive method on the sparse Vector AbstractFunction2<Object, Object, BoxedUnit> f = new AbstractFunction2<Object,

Use of foreachActive for spark Vector in Java

阅读更多关于 Use of foreachActive for spark Vector in Java

How can I build a CoordinateMatrix in Spark using a DataFrame?

阅读更多关于 How can I build a CoordinateMatrix in Spark using a DataFrame?

问题 I am trying to use the Spark implementation of the ALS algorithm for recommendation systems, so I built the DataFrame depicted below, as training data: |--------------|--------------|--------------| | userId | itemId | rating | |--------------|--------------|--------------| Now, I would like to create a sparse matrix, to represent the interactions between every user and every item. The matrix will be sparse because if there is no interaction between a user and an item, the corresponding value

apache spark MLLib: how to build labeled points for string features?

阅读更多关于 apache spark MLLib: how to build labeled points for string features?

问题 I am trying to build a NaiveBayes classifier with Spark's MLLib which takes as input a set of documents. I'd like to put some things as features (i.e. authors, explicit tags, implicit keywords, category), but looking at the documentation it seems that a LabeledPoint contains only doubles, i.e it looks like LabeledPoint[Double, List[Pair[Double,Double]] . Instead what I have as output from the rest of my code would be something like LabeledPoint[Double, List[Pair[String,Double]] . I could make

Predicting probabilities of classes in case of Gradient Boosting Trees in Spark using the tree output

阅读更多关于 Predicting probabilities of classes in case of Gradient Boosting Trees in Spark using the tree output

问题 It is known that GBT s in Spark gives you predicted labels as of now. I was thinking of trying to calculate predicted probabilities for a class (say all the instances falling under a certain leaf) The codes to build GBT's import org.apache.spark.SparkContext import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.tree.GradientBoostedTrees import org.apache.spark.mllib.tree.configuration.BoostingStrategy import org.apache

Anomaly detection with PCA in Spark

阅读更多关于 Anomaly detection with PCA in Spark

问题 I read the following article Anomaly detection with Principal Component Analysis (PCA) In the article is written following: • PCA algorithm basically transforms data readings from an existing coordinate system into a new coordinate system. • The closer data readings are to the center of the new coordinate system, the closer these readings are to an optimum value. • The anomaly score is calculated using the Mahalanobis distance between a reading and the mean of all readings, which is the

Spark 1.5.1, MLLib Random Forest Probability

阅读更多关于 Spark 1.5.1, MLLib Random Forest Probability

问题 I am using Spark 1.5.1 with MLLib. I built a random forest model using MLLib, now use the model to do prediction. I can find the predict category (0.0 or 1.0) using the .predict function. However, I can't find the function to retrieve the probability (see the attached screenshot). I thought spark 1.5.1 random forest would provide the probability, am I missing anything here? 回答1: Unfortunately the feature is not available in the older Spark MLlib 1.5.1. You can however find it in the recent

Spark 1.5.1, MLLib Random Forest Probability

阅读更多关于 Spark 1.5.1, MLLib Random Forest Probability

How to transform a categorical variable in Spark into a set of columns coded as {0,1}?

阅读更多关于 How to transform a categorical variable in Spark into a set of columns coded as {0,1}?

问题 I'm trying to perform a logistic regression (LogisticRegressionWithLBFGS) with Spark MLlib (with Scala) on a dataset which contains categorical variables. I discover Spark was not able to work with that kind of variable. In R there is a simple way to deal with that kind of problem : I transform the variable in factor (categories), so R creates a set of columns coded as {0,1} indicator variables. How can I perform this with Spark? 回答1: Using VectorIndexer, you may tell the indexer the number

How to convert type Row into Vector to feed to the KMeans

阅读更多关于 How to convert type Row into Vector to feed to the KMeans

问题 when i try to feed df2 to kmeans i get the following error clusters = KMeans.train(df2, 10, maxIterations=30, runs=10, initializationMode="random") The error i get: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector df2 is a dataframe created as follow: df = sqlContext.read.json("data/ALS3.json") df2 = df.select('latitude','longitude') df2.show() latitude| longitude| 60.1643075| 24.9460844| 60.4686748| 22.2774728| how can i convert this two columns to Vector and feed it to KMeans