apache-spark-mllib | 易学教程

Apache Spark MLLib - Running KMeans with IDF-TF vectors - Java heap space

阅读更多关于 Apache Spark MLLib - Running KMeans with IDF-TF vectors - Java heap space

I'm trying to run a KMeans on MLLib from a (large) collection of text documents (TF-IDF vectors). Documents are sent through a Lucene English analyzer, and sparse vectors are created from HashingTF.transform() function. Whatever the degree of parrallelism I'm using (through the coalesce function), KMeans.train always return an OutOfMemory exception below. Any thought on how to tackle this issue ? Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:138) at scala.reflect.ManifestFactory$$anon$12.newArray

How to use spark Naive Bayes classifier for text classification with IDF?

阅读更多关于 How to use spark Naive Bayes classifier for text classification with IDF?

I want to convert text documents into feature vectors using tf-idf, and then train a naive bayes algorithm to classify them. I can easily load my text files without the labels and use HashingTF() to convert it into a vector, and then use IDF() to weight the words according to how important they are. But if I do that I get rid of the labels and it seems to be impossible to recombine the label with the vector even though the order is the same. On the other hand, I can call HashingTF() on each individual document and keep the labels, but then I can't call IDF() on it since it requires the whole

Converting a vector column in a dataframe back into an array column

阅读更多关于 Converting a vector column in a dataframe back into an array column

I have a dataframe with two columns one of which (called dist) is a dense vector. How can I convert it back into an array column of integers. +---+-----+ | id| dist| +---+-----+ |1.0|[2.0]| |2.0|[4.0]| |3.0|[6.0]| |4.0|[8.0]| +---+-----+ I tried using several variants of the following udf but it returns a type mismatch error val toInt4 = udf[Int, Vector]({ (a) => (a)}) val result = df.withColumn("dist", toDf4(df("dist"))).select("dist") Daniel Darabos I think it's easiest to do it by going to the RDD API and then back. import org.apache.spark.mllib.linalg.DenseVector import org.apache.spark

Mllib dependency error

阅读更多关于 Mllib dependency error

问题 I'm trying to build a very simple scala standalone app using the Mllib, but I get the following error when trying to bulid the program: Object Mllib is not a member of package org.apache.spark Then, I realized that I have to add Mllib as dependency as follow : version := "1" scalaVersion :="2.10.4" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.1.0", "org.apache.spark" %% "spark-mllib" % "1.1.0" ) But, here I got an error that says : unresolved dependency spark-core_2.10

PCA in Spark MLlib and Spark ML

阅读更多关于 PCA in Spark MLlib and Spark ML

Spark now has two machine learning libraries - Spark MLlib and Spark ML. They do somewhat overlap in what is implemented, but as I understand (as a person new to the whole Spark ecosystem) Spark ML is the way to go and MLlib is still around mostly for backward compatibility. My question is very concrete and related to PCA. In MLlib implementation there seems to be a limitation of the number of columns spark.mllib supports PCA for tall-and-skinny matrices stored in row-oriented format and any Vectors. Also, if you look at the Java code example there is also this The number of columns should be

Spark MLlib - trainImplicit warning

阅读更多关于 Spark MLlib - trainImplicit warning

I keep seeing these warnings when using trainImplicit : WARN TaskSetManager: Stage 246 contains a task of very large size (208 KB). The maximum recommended task size is 100 KB. And then the task size starts to increase. I tried to call repartition on the input RDD but the warnings are the same. All these warnings come from ALS iterations, from flatMap and also from aggregate, for instance the origin of the stage where the flatMap is showing these warnings (w/ Spark 1.3.0, but they are also shown in Spark 1.3.1): org.apache.spark.rdd.RDD.flatMap(RDD.scala:296) org.apache.spark.ml.recommendation

What is the right way to save\load models in Spark\PySpark

阅读更多关于 What is the right way to save\load models in Spark\PySpark

问题 I'm working with Spark 1.3.0 using PySpark and MLlib and I need to save and load my models. I use code like this (taken from the official documentation ) from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating data = sc.textFile("data/mllib/als/test.data") ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2]))) rank = 10 numIterations = 20 model = ALS.train(ratings, rank, numIterations) testdata = ratings.map(lambda p: (p[0],

Scaling each column of a dataframe

阅读更多关于 Scaling each column of a dataframe

I am trying to scale every column of a dataframe. First I convert each column into a vector and then I use the ml MinMax Scaler. Is there a better/more elegant way to apply the same function to each column other than simply repeating it? import org.apache.spark.ml.linalg.Vectors import org.apache.spark.ml.linalg.DenseVector import org.apache.spark.sql.functions.udf import org.apache.spark.ml.feature.MinMaxScaler import org.apache.spark.sql.DataFrame val toVector = udf((vct:Double) => Vectors.dense(Array(vct)) ) val df = (Seq((1,5,3),(4,2,9),(7,8,6))).toDF("A","B","C") val dfVec = df.withColumn

The value of “spark.yarn.executor.memoryOverhead” setting?

阅读更多关于 The value of “spark.yarn.executor.memoryOverhead” setting?

问题 The value of spark.yarn.executor.memoryOverhead in a Spark job with YARN should be allocated to App or just the max value? 回答1: spark.yarn.executor.memoryOverhead Is just the max value .The goal is to calculate OVERHEAD as a percentage of real executor memory, as used by RDDs and DataFrames --executor-memory/spark.executor.memory controls the executor heap size, but JVMs can also use some memory off heap, for example for interned Strings and direct byte buffers. The value of the spark.yarn

Spark ML Pipeline with RandomForest takes too long on 20MB dataset

阅读更多关于 Spark ML Pipeline with RandomForest takes too long on 20MB dataset

I am using Spark ML to run some ML experiments, and on a small dataset of 20MB ( Poker dataset ) and a Random Forest with parameter grid, it takes 1h and 30 minutes to finish. Similarly with scikit-learn it takes much much less. In terms of environment, I was testing with 2 slaves, 15GB memory each, 24 cores. I assume it was not supposed to take that long and I am wondering if the problem lies within my code, since I am fairly new to Spark. Here it is: df = pd.read_csv(http://archive.ics.uci.edu/ml/machine-learning-databases/poker/poker-hand-testing.data) dataframe = sqlContext.createDataFrame