apache-spark-mllib | 易学教程

What hashing function does Spark use for HashingTF and how do I duplicate it?

阅读更多关于 What hashing function does Spark use for HashingTF and how do I duplicate it?

问题 Spark MLLIb has a HashingTF() function that computes document term frequencies based on a hashed value of each of the terms. 1) what function does it use to do the hashing? 2) How can I achieve the same hashed value from Python? 3) If I want to compute the hashed output for a given single input, without computing the term frequency, how can I do this? 回答1: If you're in doubt is it usually good to check the source. The bucket for a given term is determined as follows: def indexOf(self, term):

How can I evaluate the implicit feedback ALS algorithm for recommendations in Apache Spark?

阅读更多关于 How can I evaluate the implicit feedback ALS algorithm for recommendations in Apache Spark?

问题 How can you evaluate the implicit feedback collaborative filtering algorithm of Apache Spark, given that the implicit "ratings" can vary from zero to anything, so a simple MSE or RMSE does not have much meaning? 回答1: To answer this question, you'll need to go back to the original paper that defined what is implicit feedback and the ALS algorithm Collaborative Filtering for Implicit Feedback Datasets by Yifan Hu, Yehuda Koren and Chris Volinsky . What is implicit feedback ? In the absence of

Create labeledPoints from Spark DataFrame in Python

阅读更多关于 Create labeledPoints from Spark DataFrame in Python

问题 What .map() function in python do I use to create a set of labeledPoints from a spark dataframe? What is the notation if The label/outcome is not the first column but I can refer to its column name, 'status'? I create the Python dataframe with this .map() function: def parsePoint(line): listmp = list(line.split('\t')) dataframe = pd.DataFrame(pd.get_dummies(listmp[1:]).sum()).transpose() dataframe.insert(0, 'status', dataframe['accepted']) if 'NULL' in dataframe.columns: dataframe = dataframe

Why does foreachRDD not populate DataFrame with new content using StreamingContext.textFileStream?

阅读更多关于 Why does foreachRDD not populate DataFrame with new content using StreamingContext.textFileStream?

问题 My problem is that, as I change my code into streaming mode and put my data frame into the foreach loop, the data frame shows empty table! I does't fill! I also can not put it into assembler.transform(). The error is: Error:(38, 40) not enough arguments for method map: (mapFunc: String => U)(implicit evidence$2: scala.reflect.ClassTag[U])org.apache.spark.streaming.dstream.DStream[U]. Unspecified value parameter mapFunc. val dataFrame = Train_DStream.map() My train.csv file is like below:

Why does my weights get normalized when I perform Logistic Regression With SGD in spark?

阅读更多关于 Why does my weights get normalized when I perform Logistic Regression With SGD in spark?

问题 I recently asked a question being confused about the weights I was receiving for the synthetic dataset I created. The answer I received was that the weights are being normalized. You can look at the details here. I'm wondering why LogisticRegressionWithSGD gives normalized weights whereas everything is fine in case of LBFGS in the same spark implementation. Is it possible that the weights weren't converging after all? Weights I'm getting [0.466521045342,0.699614292387,0.932673108363,0

Spark-Shell— error: object jblas is not a member of package org (Windows)

阅读更多关于 Spark-Shell— error: object jblas is not a member of package org (Windows)

问题 I am running code in spark shell in windows import org.jblas.DoubleMatrix The error which I am getting is error: object jblas is not a member of package org I researched on stackoverflow, but answer is available for Linux system only. Any help will be greatly appreciated. Kind regards, Innocent 回答1: You should add jblas to your classpath when you startup spark-shell such as: bin/spark-shell --packages org.jblas:jblas:1.2.4-SNAPSHOT Then, the ivy in the spark distribution will load jblas

Issues with Logistic Regression for multiclass classification using PySpark

阅读更多关于 Issues with Logistic Regression for multiclass classification using PySpark

问题 I am trying to use Logistic Regression to classify the datasets which has Sparse Vector in feature vector: For full code base and error log, please check my github repo Case 1 : I tried using the pipeline of ML as follow: # imported library from ML from pyspark.ml.feature import HashingTF from pyspark.ml import Pipeline from pyspark.ml.classification import LogisticRegression print(type(trainingData)) # for checking only print(trainingData.take(2)) # for of data type lr = LogisticRegression

Apache Spark's RDD[Vector] Immutability issue

阅读更多关于 Apache Spark's RDD[Vector] Immutability issue

问题 I know the RDDs are immutable and therefore their value cannot be changed but I see the following behaviour: I wrote an implementation for FuzzyCMeans (https://github.com/salexln/FinalProject_FCM) algorithm and now I'm testing it, so I run the following example: import org.apache.spark.mllib.clustering.FuzzyCMeans import org.apache.spark.mllib.linalg.Vectors val data = sc.textFile("/home/development/myPrjects/R/butterfly/butterfly.txt") val parsedData = data.map(s => Vectors.dense(s.split(' '

What exactly is the initializationSteps parameter in Kmeans++ in Spark MLLib?

阅读更多关于 What exactly is the initializationSteps parameter in Kmeans++ in Spark MLLib?

问题 I know what k-means is and I also understand what k-means++ algorithm is. I believe the only change is the way the initial K centers are found. In the ++ version we initially choose a center and using a probability distribution we choose the remaining k-1 centers. In the MLLib algorithm for k-means what is the initializationSteps parameter? 回答1: To be precise k-means++ is an algorithm for choosing initial centers and it doesn't describe a whole training process. MLLib k-means is using k-means

spark pipeline vector assembler drop other columns

阅读更多关于 spark pipeline vector assembler drop other columns

问题 A spark VectorAssembler http://spark.apache.org/docs/latest/ml-features.html#vectorassembler produces the following output id | hour | mobile | userFeatures | clicked | features ----|------|--------|------------------|---------|----------------------------- 0 | 18 | 1.0 | [0.0, 10.0, 0.5] | 1.0 | [18.0, 1.0, 0.0, 10.0, 0.5] as you can see the last column contains all the previous features. Is it better / more performant if the other columns are removed e.g. only the label/id and features are