apache-spark-mllib

How to prepare data into a LibSVM format from DataFrame?

折月煮酒 提交于 2019-11-29 22:32:53
I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that the desired libsvm type is user item:rating . If you know what to do in the current situation : val ratings = sc.textFile(new File("/user/ubuntu/kang/0829/rawRatings.csv").toString).map { line => val fields = line.split(",") (fields(0).toInt,fields(1).toInt,fields(2).toDouble) } val user = ratings.map{ case (user,product,rate) => (user,(product.toInt,rate.toDouble))} val usergroup = user.groupByKey val data =usergroup.map{

Issues with Logistic Regression for multiclass classification using PySpark

為{幸葍}努か 提交于 2019-11-29 22:29:42
问题 I am trying to use Logistic Regression to classify the datasets which has Sparse Vector in feature vector: For full code base and error log, please check my github repo Case 1 : I tried using the pipeline of ML as follow: # imported library from ML from pyspark.ml.feature import HashingTF from pyspark.ml import Pipeline from pyspark.ml.classification import LogisticRegression print(type(trainingData)) # for checking only print(trainingData.take(2)) # for of data type lr = LogisticRegression

Spark SQL removing white spaces

折月煮酒 提交于 2019-11-29 18:17:49
I have a simple Spark Program which reads a JSON file and emits a CSV file. IN the JSON data the values contain leading and trailing white spaces, when I emit the CSV the leading and trailing white spaces are gone. Is there a way I can retain the spaces. I tried many options like ignoreTrailingWhiteSpace , ignoreLeadingWhiteSpace but no luck input.json {"key" : "k1", "value1": "Good String", "value2": "Good String"} {"key" : "k1", "value1": "With Spaces ", "value2": "With Spaces "} {"key" : "k1", "value1": "with tab\t", "value2": "with tab\t"} output.csv _corrupt_record,key,value1,value2 ,k1

pyspark Linear Regression Example from official documentation - Bad results?

只谈情不闲聊 提交于 2019-11-29 16:49:47
I am planning to use Linear Regression in Spark. To get started, I checked out the example from the official documentation ( which you can find here ) I also found this question on stackoverflow , which is essentially the same question as mine. The answer suggest to tweak the step size, which I also tried to do, however the results are still as random as without tweaking the step size. The code I'm using looks like this: from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel # Load and parse the data def parsePoint(line): values = [float(x) for x in

Non-integer ids in Spark MLlib ALS

六眼飞鱼酱① 提交于 2019-11-29 16:43:06
I'd like to use val ratings = data.map(_.split(',') match { case Array(user,item,rate) => Rating(user.toInt,item.toInt,rate.toFloat) }) val model = ALS.train(ratings,rank,numIterations,alpha) However, the user data i get are stored as Long. When switched to int, it may produce error. How can i do to solve the problem? You can use one of ML implementations which support Long labels. RDD version it is significantly less user friendly compared to other implementations: import org.apache.spark.ml.recommendation.ALS import org.apache.spark.ml.recommendation.ALS.Rating val ratings = sc.parallelize

What hashing function does Spark use for HashingTF and how do I duplicate it?

丶灬走出姿态 提交于 2019-11-29 15:36:51
Spark MLLIb has a HashingTF() function that computes document term frequencies based on a hashed value of each of the terms. 1) what function does it use to do the hashing? 2) How can I achieve the same hashed value from Python? 3) If I want to compute the hashed output for a given single input, without computing the term frequency, how can I do this? If you're in doubt is it usually good to check the source . The bucket for a given term is determined as follows: def indexOf(self, term): """ Returns the index of the input term. """ return hash(term) % self.numFeatures As you can see it is just

Predict Class Probabilities in Spark RandomForestClassifier

 ̄綄美尐妖づ 提交于 2019-11-29 14:29:09
I built random forest models using ml.classification.RandomForestClassifier. I am trying to extract the predict probabilities from the models but I only saw prediction classes instead of the probabilities. According to this issue link , the issue is resolved and it leads to this github pull request and this . However, It seems it's resolved in the version 1.5. I'm using the AWS EMR which provides Spark 1.4.1 and sill have no idea how to get the predict probabilities. If anyone knows how to do it, please share your thought or solutions. Thanks! eliasah I have already answered a similar question

Why is Spark Mllib KMeans algorithm extremely slow?

拟墨画扇 提交于 2019-11-29 14:20:47
I'm having the same problem as in this post , but I don't have enough points to add a comment there. My dataset has 1 Million rows, 100 cols. I'm using Mllib KMeans also and it is extremely slow. The job never finishes in fact and I have to kill it. I am running this on Google cloud (dataproc). It runs if I ask for a smaller number of clusters (k=1000), but still take more than 35 minutes. I need it to run for k~5000. I have no idea why is it so slow. The data is properly partitioned given the number of workers/nodes and SVD on a 1 million x ~300,000 col matrix takes ~3 minutes, but when it

Forward fill missing values in Spark/Python

你。 提交于 2019-11-29 14:00:24
I am attempting to fill in missing values in my Spark dataframe with the previous non-null value (if it exists). I've done this type of thing in Python/Pandas but my data is too big for Pandas (on a small cluster) and I'm Spark noob. Is this something Spark can do? Can it do it for multiple columns? If so, how? If not, any suggestions for alternative approaches within the who Hadoop suite of tools? Thanks! Romeo Kienzler I've found a solution that works without additional coding by using a Window here . So Jeff was right, there is a solution. full code boelow, I'll briefly explain what it does

Apache Spark: How to create a matrix from a DataFrame?

浪尽此生 提交于 2019-11-29 13:42:15
问题 I have a DataFrame in Apache Spark with an array of integers, the source is a set of images. I ultimately want to do PCA on it, but I am having trouble just creating a matrix from my arrays. How do I create a matrix from a RDD? > imagerdd = traindf.map(lambda row: map(float, row.image)) > mat = DenseMatrix(numRows=206456, numCols=10, values=imagerdd) Traceback (most recent call last): File "<ipython-input-21-6fdaa8cde069>", line 2, in <module> mat = DenseMatrix(numRows=206456, numCols=10,