apache-spark-mllib

Use of foreachActive for spark Vector in Java

谁都会走 提交于 2020-01-01 19:41:32
问题 How to write simple code in Java which iterate over active elements in sparse vector? Lets say we have such Vector: Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0}); I was trying with lambda or Function2 (from three different imports but always failed). If you use Function2 please provide necessary import. 回答1: Adrian, here is how you can use the foreachActive method on the sparse Vector AbstractFunction2<Object, Object, BoxedUnit> f = new AbstractFunction2<Object,

Use of foreachActive for spark Vector in Java

倾然丶 夕夏残阳落幕 提交于 2020-01-01 19:40:15
问题 How to write simple code in Java which iterate over active elements in sparse vector? Lets say we have such Vector: Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0}); I was trying with lambda or Function2 (from three different imports but always failed). If you use Function2 please provide necessary import. 回答1: Adrian, here is how you can use the foreachActive method on the sparse Vector AbstractFunction2<Object, Object, BoxedUnit> f = new AbstractFunction2<Object,

How can I build a CoordinateMatrix in Spark using a DataFrame?

给你一囗甜甜゛ 提交于 2020-01-01 11:58:10
问题 I am trying to use the Spark implementation of the ALS algorithm for recommendation systems, so I built the DataFrame depicted below, as training data: |--------------|--------------|--------------| | userId | itemId | rating | |--------------|--------------|--------------| Now, I would like to create a sparse matrix, to represent the interactions between every user and every item. The matrix will be sparse because if there is no interaction between a user and an item, the corresponding value

apache spark MLLib: how to build labeled points for string features?

ぃ、小莉子 提交于 2020-01-01 07:38:19
问题 I am trying to build a NaiveBayes classifier with Spark's MLLib which takes as input a set of documents. I'd like to put some things as features (i.e. authors, explicit tags, implicit keywords, category), but looking at the documentation it seems that a LabeledPoint contains only doubles, i.e it looks like LabeledPoint[Double, List[Pair[Double,Double]] . Instead what I have as output from the rest of my code would be something like LabeledPoint[Double, List[Pair[String,Double]] . I could make

Predicting probabilities of classes in case of Gradient Boosting Trees in Spark using the tree output

限于喜欢 提交于 2020-01-01 05:29:09
问题 It is known that GBT s in Spark gives you predicted labels as of now. I was thinking of trying to calculate predicted probabilities for a class (say all the instances falling under a certain leaf) The codes to build GBT's import org.apache.spark.SparkContext import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.tree.GradientBoostedTrees import org.apache.spark.mllib.tree.configuration.BoostingStrategy import org.apache

Anomaly detection with PCA in Spark

好久不见. 提交于 2019-12-31 03:36:07
问题 I read the following article Anomaly detection with Principal Component Analysis (PCA) In the article is written following: • PCA algorithm basically transforms data readings from an existing coordinate system into a new coordinate system. • The closer data readings are to the center of the new coordinate system, the closer these readings are to an optimum value. • The anomaly score is calculated using the Mahalanobis distance between a reading and the mean of all readings, which is the

Spark 1.5.1, MLLib Random Forest Probability

北战南征 提交于 2019-12-30 07:12:22
问题 I am using Spark 1.5.1 with MLLib. I built a random forest model using MLLib, now use the model to do prediction. I can find the predict category (0.0 or 1.0) using the .predict function. However, I can't find the function to retrieve the probability (see the attached screenshot). I thought spark 1.5.1 random forest would provide the probability, am I missing anything here? 回答1: Unfortunately the feature is not available in the older Spark MLlib 1.5.1. You can however find it in the recent

Spark 1.5.1, MLLib Random Forest Probability

心不动则不痛 提交于 2019-12-30 07:12:02
问题 I am using Spark 1.5.1 with MLLib. I built a random forest model using MLLib, now use the model to do prediction. I can find the predict category (0.0 or 1.0) using the .predict function. However, I can't find the function to retrieve the probability (see the attached screenshot). I thought spark 1.5.1 random forest would provide the probability, am I missing anything here? 回答1: Unfortunately the feature is not available in the older Spark MLlib 1.5.1. You can however find it in the recent

How to transform a categorical variable in Spark into a set of columns coded as {0,1}?

不想你离开。 提交于 2019-12-30 01:20:09
问题 I'm trying to perform a logistic regression (LogisticRegressionWithLBFGS) with Spark MLlib (with Scala) on a dataset which contains categorical variables. I discover Spark was not able to work with that kind of variable. In R there is a simple way to deal with that kind of problem : I transform the variable in factor (categories), so R creates a set of columns coded as {0,1} indicator variables. How can I perform this with Spark? 回答1: Using VectorIndexer, you may tell the indexer the number

How to convert type Row into Vector to feed to the KMeans

泄露秘密 提交于 2019-12-30 00:39:49
问题 when i try to feed df2 to kmeans i get the following error clusters = KMeans.train(df2, 10, maxIterations=30, runs=10, initializationMode="random") The error i get: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector df2 is a dataframe created as follow: df = sqlContext.read.json("data/ALS3.json") df2 = df.select('latitude','longitude') df2.show() latitude| longitude| 60.1643075| 24.9460844| 60.4686748| 22.2774728| how can i convert this two columns to Vector and feed it to KMeans