apache-spark-mllib

Apache Spark MLLib - Running KMeans with IDF-TF vectors - Java heap space

谁说胖子不能爱 提交于 2019-12-03 17:29:14
I'm trying to run a KMeans on MLLib from a (large) collection of text documents (TF-IDF vectors). Documents are sent through a Lucene English analyzer, and sparse vectors are created from HashingTF.transform() function. Whatever the degree of parrallelism I'm using (through the coalesce function), KMeans.train always return an OutOfMemory exception below. Any thought on how to tackle this issue ? Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:138) at scala.reflect.ManifestFactory$$anon$12.newArray

How to use spark Naive Bayes classifier for text classification with IDF?

谁说胖子不能爱 提交于 2019-12-03 15:35:34
I want to convert text documents into feature vectors using tf-idf, and then train a naive bayes algorithm to classify them. I can easily load my text files without the labels and use HashingTF() to convert it into a vector, and then use IDF() to weight the words according to how important they are. But if I do that I get rid of the labels and it seems to be impossible to recombine the label with the vector even though the order is the same. On the other hand, I can call HashingTF() on each individual document and keep the labels, but then I can't call IDF() on it since it requires the whole

Converting a vector column in a dataframe back into an array column

本小妞迷上赌 提交于 2019-12-03 14:13:07
I have a dataframe with two columns one of which (called dist) is a dense vector. How can I convert it back into an array column of integers. +---+-----+ | id| dist| +---+-----+ |1.0|[2.0]| |2.0|[4.0]| |3.0|[6.0]| |4.0|[8.0]| +---+-----+ I tried using several variants of the following udf but it returns a type mismatch error val toInt4 = udf[Int, Vector]({ (a) => (a)}) val result = df.withColumn("dist", toDf4(df("dist"))).select("dist") Daniel Darabos I think it's easiest to do it by going to the RDD API and then back. import org.apache.spark.mllib.linalg.DenseVector import org.apache.spark

Mllib dependency error

◇◆丶佛笑我妖孽 提交于 2019-12-03 12:30:29
问题 I'm trying to build a very simple scala standalone app using the Mllib, but I get the following error when trying to bulid the program: Object Mllib is not a member of package org.apache.spark Then, I realized that I have to add Mllib as dependency as follow : version := "1" scalaVersion :="2.10.4" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.1.0", "org.apache.spark" %% "spark-mllib" % "1.1.0" ) But, here I got an error that says : unresolved dependency spark-core_2.10

PCA in Spark MLlib and Spark ML

浪子不回头ぞ 提交于 2019-12-03 11:56:43
Spark now has two machine learning libraries - Spark MLlib and Spark ML. They do somewhat overlap in what is implemented, but as I understand (as a person new to the whole Spark ecosystem) Spark ML is the way to go and MLlib is still around mostly for backward compatibility. My question is very concrete and related to PCA. In MLlib implementation there seems to be a limitation of the number of columns spark.mllib supports PCA for tall-and-skinny matrices stored in row-oriented format and any Vectors. Also, if you look at the Java code example there is also this The number of columns should be

Spark MLlib - trainImplicit warning

爷,独闯天下 提交于 2019-12-03 11:52:02
I keep seeing these warnings when using trainImplicit : WARN TaskSetManager: Stage 246 contains a task of very large size (208 KB). The maximum recommended task size is 100 KB. And then the task size starts to increase. I tried to call repartition on the input RDD but the warnings are the same. All these warnings come from ALS iterations, from flatMap and also from aggregate, for instance the origin of the stage where the flatMap is showing these warnings (w/ Spark 1.3.0, but they are also shown in Spark 1.3.1): org.apache.spark.rdd.RDD.flatMap(RDD.scala:296) org.apache.spark.ml.recommendation

What is the right way to save\load models in Spark\PySpark

让人想犯罪 __ 提交于 2019-12-03 11:50:39
问题 I'm working with Spark 1.3.0 using PySpark and MLlib and I need to save and load my models. I use code like this (taken from the official documentation ) from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating data = sc.textFile("data/mllib/als/test.data") ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2]))) rank = 10 numIterations = 20 model = ALS.train(ratings, rank, numIterations) testdata = ratings.map(lambda p: (p[0],

Scaling each column of a dataframe

独自空忆成欢 提交于 2019-12-03 09:37:34
I am trying to scale every column of a dataframe. First I convert each column into a vector and then I use the ml MinMax Scaler. Is there a better/more elegant way to apply the same function to each column other than simply repeating it? import org.apache.spark.ml.linalg.Vectors import org.apache.spark.ml.linalg.DenseVector import org.apache.spark.sql.functions.udf import org.apache.spark.ml.feature.MinMaxScaler import org.apache.spark.sql.DataFrame val toVector = udf((vct:Double) => Vectors.dense(Array(vct)) ) val df = (Seq((1,5,3),(4,2,9),(7,8,6))).toDF("A","B","C") val dfVec = df.withColumn

The value of “spark.yarn.executor.memoryOverhead” setting?

佐手、 提交于 2019-12-03 08:42:30
问题 The value of spark.yarn.executor.memoryOverhead in a Spark job with YARN should be allocated to App or just the max value? 回答1: spark.yarn.executor.memoryOverhead Is just the max value .The goal is to calculate OVERHEAD as a percentage of real executor memory, as used by RDDs and DataFrames --executor-memory/spark.executor.memory controls the executor heap size, but JVMs can also use some memory off heap, for example for interned Strings and direct byte buffers. The value of the spark.yarn

Spark ML Pipeline with RandomForest takes too long on 20MB dataset

吃可爱长大的小学妹 提交于 2019-12-03 08:17:14
I am using Spark ML to run some ML experiments, and on a small dataset of 20MB ( Poker dataset ) and a Random Forest with parameter grid, it takes 1h and 30 minutes to finish. Similarly with scikit-learn it takes much much less. In terms of environment, I was testing with 2 slaves, 15GB memory each, 24 cores. I assume it was not supposed to take that long and I am wondering if the problem lies within my code, since I am fairly new to Spark. Here it is: df = pd.read_csv(http://archive.ics.uci.edu/ml/machine-learning-databases/poker/poker-hand-testing.data) dataframe = sqlContext.createDataFrame