apache-spark-mllib | 易学教程

RDD transformations and actions can only be invoked by the driver

阅读更多关于 RDD transformations and actions can only be invoked by the driver

问题 Error: org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. def computeRatio(model: MatrixFactorizationModel, test_data: org.apache.spark.rdd.RDD[Rating]): Double = { val numDistinctUsers = test_data

Column name with dot spark

阅读更多关于 Column name with dot spark

I am trying to take columns from a DataFrame and convert it to an RDD[Vector] . The problem is that I have columns with a "dot" in their name as the following dataset : "col0.1","col1.2","col2.3","col3.4" 1,2,3,4 10,12,15,3 1,12,10,5 This is what I'm doing : val df = spark.read.format("csv").options(Map("header" -> "true", "inferSchema" -> "true")).load("C:/Users/mhattabi/Desktop/donnee/test.txt") val column=df.columns.map(c=>s"`${c}`") val rows = new VectorAssembler().setInputCols(column).setOutputCol("vs") .transform(df) .select("vs") .rdd val data =rows.map(_.getAs[org.apache.spark.ml

Why spark.ml don't implement any of spark.mllib algorithms?

阅读更多关于 Why spark.ml don't implement any of spark.mllib algorithms?

Following the Spark MLlib Guide we can read that Spark has two machine learning libraries: spark.mllib , built on top of RDDs. spark.ml , built on top of Dataframes. According to this and this question on StackOverflow, Dataframes are better (and newer) than RDDs and should be used whenever possible. The problem is that I want to use common machine learning algorithms (e.g: Frequent Pattern Mining , Naive Bayes , etc.) and spark.ml (for dataframes) don't provide such methods, only spark.mllib (for RDDs) provides this algorithms. If Dataframes are better than RDDs and the referred guide

Save Spark org.apache.spark.mllib.linalg.Matrix to a file

阅读更多关于 Save Spark org.apache.spark.mllib.linalg.Matrix to a file

问题 The result of correlation in Spark MLLib is a of type org.apache.spark.mllib.linalg.Matrix. (see http://spark.apache.org/docs/1.2.1/mllib-statistics.html#correlations) val data: RDD[Vector] = ... val correlMatrix: Matrix = Statistics.corr(data, "pearson") I would like to save the result into a file. How can I do this? 回答1: Here is a simple and effective approach to save the Matrix to hdfs and specify the separator. (The transpose is used since .toArray is in column major format.) val

Save Spark org.apache.spark.mllib.linalg.Matrix to a file

阅读更多关于 Save Spark org.apache.spark.mllib.linalg.Matrix to a file

The result of correlation in Spark MLLib is a of type org.apache.spark.mllib.linalg.Matrix. (see http://spark.apache.org/docs/1.2.1/mllib-statistics.html#correlations ) val data: RDD[Vector] = ... val correlMatrix: Matrix = Statistics.corr(data, "pearson") I would like to save the result into a file. How can I do this? Here is a simple and effective approach to save the Matrix to hdfs and specify the separator. (The transpose is used since .toArray is in column major format.) val localMatrix: List[Array[Double]] = correlMatrix .transpose // Transpose since .toArray is column major .toArray

run spark as java web application

阅读更多关于 run spark as java web application

I have used Spark ML and was able to get reasonable accuracy in prediction for my business problem The data is not huge and I was able to transform the input ( basically a csv file ) using stanford NLP and run Naive Bayes for prediction in my local machine. I want to run this prediction service like a simple java main program or along with a simple MVC web application Currently I run my prediction using the spark-submit command ? Instead , can I create spark context and data frames from my servlet / controller class ? I could not find any documentation on such scenarios. Kindly advise

ALS model - predicted full_u * v^t * v ratings are very high

阅读更多关于 ALS model - predicted full_u * v^t * v ratings are very high

I'm predicting ratings in between processes that batch train the model. I'm using the approach outlined here: ALS model - how to generate full_u * v^t * v? ! rm -rf ml-1m.zip ml-1m ! wget --quiet http://files.grouplens.org/datasets/movielens/ml-1m.zip ! unzip ml-1m.zip ! mv ml-1m/ratings.dat . from pyspark.mllib.recommendation import Rating ratingsRDD = sc.textFile('ratings.dat') \ .map(lambda l: l.split("::")) \ .map(lambda p: Rating( user = int(p[0]), product = int(p[1]), rating = float(p[2]), )).cache() from pyspark.mllib.recommendation import ALS rank = 50 numIterations = 20 lambdaParam =

Apache Spark: How to create a matrix from a DataFrame?

阅读更多关于 Apache Spark: How to create a matrix from a DataFrame?

I have a DataFrame in Apache Spark with an array of integers, the source is a set of images. I ultimately want to do PCA on it, but I am having trouble just creating a matrix from my arrays. How do I create a matrix from a RDD? > imagerdd = traindf.map(lambda row: map(float, row.image)) > mat = DenseMatrix(numRows=206456, numCols=10, values=imagerdd) Traceback (most recent call last): File "<ipython-input-21-6fdaa8cde069>", line 2, in <module> mat = DenseMatrix(numRows=206456, numCols=10, values=imagerdd) File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 815,

How to encode categorical features in Apache Spark

阅读更多关于 How to encode categorical features in Apache Spark

问题 I have a set of data based on which I want to create a classification model. Each row has the following form: user1,class1,product1 user1,class1,product2 user1,class1,product5 user2,class1,product2 user2,class1,product5 user3,class2,product1 There are about 1M users, 2 classes, and 1M products. What I would like to do next is create the sparse vectors (something already supported by MLlib) BUT in order to apply that function I have to create the dense vectors (with the 0s), first. In other

How to convert ArrayType to DenseVector in PySpark DataFrame?

阅读更多关于 How to convert ArrayType to DenseVector in PySpark DataFrame?

问题 I'm getting the following error trying to build a ML Pipeline : pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType(DoubleType,true).' My features column contains an array of floating point values. It sounds like I need to convert those to some type of vector (it's not sparse, so a DenseVector?). Is there a way to do this directly on the DataFrame or do I need to convert to an