apache-spark-mllib

RDD transformations and actions can only be invoked by the driver

和自甴很熟 提交于 2019-11-30 17:48:49
问题 Error: org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. def computeRatio(model: MatrixFactorizationModel, test_data: org.apache.spark.rdd.RDD[Rating]): Double = { val numDistinctUsers = test_data

Column name with dot spark

江枫思渺然 提交于 2019-11-30 17:46:18
I am trying to take columns from a DataFrame and convert it to an RDD[Vector] . The problem is that I have columns with a "dot" in their name as the following dataset : "col0.1","col1.2","col2.3","col3.4" 1,2,3,4 10,12,15,3 1,12,10,5 This is what I'm doing : val df = spark.read.format("csv").options(Map("header" -> "true", "inferSchema" -> "true")).load("C:/Users/mhattabi/Desktop/donnee/test.txt") val column=df.columns.map(c=>s"`${c}`") val rows = new VectorAssembler().setInputCols(column).setOutputCol("vs") .transform(df) .select("vs") .rdd val data =rows.map(_.getAs[org.apache.spark.ml

Why spark.ml don't implement any of spark.mllib algorithms?

痞子三分冷 提交于 2019-11-30 17:32:47
Following the Spark MLlib Guide we can read that Spark has two machine learning libraries: spark.mllib , built on top of RDDs. spark.ml , built on top of Dataframes. According to this and this question on StackOverflow, Dataframes are better (and newer) than RDDs and should be used whenever possible. The problem is that I want to use common machine learning algorithms (e.g: Frequent Pattern Mining , Naive Bayes , etc.) and spark.ml (for dataframes) don't provide such methods, only spark.mllib (for RDDs) provides this algorithms. If Dataframes are better than RDDs and the referred guide

Save Spark org.apache.spark.mllib.linalg.Matrix to a file

强颜欢笑 提交于 2019-11-30 15:55:16
问题 The result of correlation in Spark MLLib is a of type org.apache.spark.mllib.linalg.Matrix. (see http://spark.apache.org/docs/1.2.1/mllib-statistics.html#correlations) val data: RDD[Vector] = ... val correlMatrix: Matrix = Statistics.corr(data, "pearson") I would like to save the result into a file. How can I do this? 回答1: Here is a simple and effective approach to save the Matrix to hdfs and specify the separator. (The transpose is used since .toArray is in column major format.) val

Save Spark org.apache.spark.mllib.linalg.Matrix to a file

一世执手 提交于 2019-11-30 15:33:00
The result of correlation in Spark MLLib is a of type org.apache.spark.mllib.linalg.Matrix. (see http://spark.apache.org/docs/1.2.1/mllib-statistics.html#correlations ) val data: RDD[Vector] = ... val correlMatrix: Matrix = Statistics.corr(data, "pearson") I would like to save the result into a file. How can I do this? Here is a simple and effective approach to save the Matrix to hdfs and specify the separator. (The transpose is used since .toArray is in column major format.) val localMatrix: List[Array[Double]] = correlMatrix .transpose // Transpose since .toArray is column major .toArray

run spark as java web application

北慕城南 提交于 2019-11-30 15:30:57
I have used Spark ML and was able to get reasonable accuracy in prediction for my business problem The data is not huge and I was able to transform the input ( basically a csv file ) using stanford NLP and run Naive Bayes for prediction in my local machine. I want to run this prediction service like a simple java main program or along with a simple MVC web application Currently I run my prediction using the spark-submit command ? Instead , can I create spark context and data frames from my servlet / controller class ? I could not find any documentation on such scenarios. Kindly advise

ALS model - predicted full_u * v^t * v ratings are very high

半腔热情 提交于 2019-11-30 13:41:46
I'm predicting ratings in between processes that batch train the model. I'm using the approach outlined here: ALS model - how to generate full_u * v^t * v? ! rm -rf ml-1m.zip ml-1m ! wget --quiet http://files.grouplens.org/datasets/movielens/ml-1m.zip ! unzip ml-1m.zip ! mv ml-1m/ratings.dat . from pyspark.mllib.recommendation import Rating ratingsRDD = sc.textFile('ratings.dat') \ .map(lambda l: l.split("::")) \ .map(lambda p: Rating( user = int(p[0]), product = int(p[1]), rating = float(p[2]), )).cache() from pyspark.mllib.recommendation import ALS rank = 50 numIterations = 20 lambdaParam =

Apache Spark: How to create a matrix from a DataFrame?

有些话、适合烂在心里 提交于 2019-11-30 08:49:01
I have a DataFrame in Apache Spark with an array of integers, the source is a set of images. I ultimately want to do PCA on it, but I am having trouble just creating a matrix from my arrays. How do I create a matrix from a RDD? > imagerdd = traindf.map(lambda row: map(float, row.image)) > mat = DenseMatrix(numRows=206456, numCols=10, values=imagerdd) Traceback (most recent call last): File "<ipython-input-21-6fdaa8cde069>", line 2, in <module> mat = DenseMatrix(numRows=206456, numCols=10, values=imagerdd) File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 815,

How to encode categorical features in Apache Spark

妖精的绣舞 提交于 2019-11-30 07:04:02
问题 I have a set of data based on which I want to create a classification model. Each row has the following form: user1,class1,product1 user1,class1,product2 user1,class1,product5 user2,class1,product2 user2,class1,product5 user3,class2,product1 There are about 1M users, 2 classes, and 1M products. What I would like to do next is create the sparse vectors (something already supported by MLlib) BUT in order to apply that function I have to create the dense vectors (with the 0s), first. In other

How to convert ArrayType to DenseVector in PySpark DataFrame?

馋奶兔 提交于 2019-11-30 07:01:17
问题 I'm getting the following error trying to build a ML Pipeline : pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType(DoubleType,true).' My features column contains an array of floating point values. It sounds like I need to convert those to some type of vector (it's not sparse, so a DenseVector?). Is there a way to do this directly on the DataFrame or do I need to convert to an