apache-spark-mllib

Spark Java IllegalArgumentException at org.apache.xbean.asm5.ClassReader

那年仲夏 提交于 2020-01-24 03:30:30
问题 I'm trying to use Spark 2.3.1 with Java. I followed examples in the documentation but keep getting poorly described exception when calling .fit(trainingData) . Exception in thread "main" java.lang.IllegalArgumentException at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source) at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source) at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source) at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46) at org.apache

Explode sparse features vector into separate columns

陌路散爱 提交于 2020-01-23 12:34:50
问题 In my spark DataFrame I have a column which includes the output of a CountVectoriser transformation - it is in sparse vector format. What I am trying to do is to 'explode' this column again into a dense vector and then it's component rows (so that it can be used for scoring by an external model). I know there are 40 features in the column, hence Following this example, I have tried: import org.apache.spark.sql.functions.udf import org.apache.spark.mllib.linalg.Vector // convert sparse vector

Spark DataFrame not respecting schema and considering everything as String

情到浓时终转凉″ 提交于 2020-01-22 19:56:23
问题 I am facing a problem which I have failed to get over for ages now. I am on Spark 1.4 and Scala 2.10. I cannot upgrade at this moment (big distributed infrastructure) I have a file with few hundred columns, only 2 of which are string and rest all are Long. I want to convert this data into a Label/Features dataframe. I have been able to get it into LibSVM format. I just cannot get it into a Label/Features format. The reason being I am not being able to use the toDF() as shown here https:/

How to encode string values into numeric values in Spark DataFrame

三世轮回 提交于 2020-01-22 02:42:33
问题 I have a DataFrame with two columns: df = Col1 Col2 aaa bbb ccc aaa I want to encode String values into numeric values. I managed to do it in this way: import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer} val indexer1 = new StringIndexer() .setInputCol("Col1") .setOutputCol("Col1Index") .fit(df) val indexer2 = new StringIndexer() .setInputCol("Col2") .setOutputCol("Col2Index") .fit(df) val indexed1 = indexer1.transform(df) val indexed2 = indexer2.transform(df) val encoder1 = new

How does Spark keep track of the splits in randomSplit?

蓝咒 提交于 2020-01-21 12:46:09
问题 This question explains how Spark's random split works, How does Sparks RDD.randomSplit actually split the RDD, but I don't understand how spark keeps track of what values went to one split so that those same values don't go to the second split. If we look at the implementation of randomSplit: def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame] = { // It is possible that the underlying dataframe doesn't guarantee the ordering of rows in its // constituent partitions each time

Spark MLlib example, NoSuchMethodError: org.apache.spark.sql.SQLContext.createDataFrame()

≯℡__Kan透↙ 提交于 2020-01-16 06:54:00
问题 I'm following the documentation example Example: Estimator, Transformer, and Param And I got error msg 15/09/23 11:46:51 INFO BlockManagerMaster: Registered BlockManager Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror; at SimpleApp$.main(hw.scala:75) And line 75 is the code "sqlContext.createDataFrame()": import java.util.Random import org.apache.log4j.Logger import org

Spark MLlib example, NoSuchMethodError: org.apache.spark.sql.SQLContext.createDataFrame()

若如初见. 提交于 2020-01-16 06:50:02
问题 I'm following the documentation example Example: Estimator, Transformer, and Param And I got error msg 15/09/23 11:46:51 INFO BlockManagerMaster: Registered BlockManager Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror; at SimpleApp$.main(hw.scala:75) And line 75 is the code "sqlContext.createDataFrame()": import java.util.Random import org.apache.log4j.Logger import org

Is it possible to obtain class probabilities using GradientBoostedTrees with spark mllib?

十年热恋 提交于 2020-01-16 01:10:55
问题 I am currently working with spark mllib. I have created a text classifier using the Gradient Boosting algorithm with the class GradientBoostedTrees: Gradient Boosted Trees Currently I obtain the predictions to know the class of new elements but I would like to obtain the class probabilities (the value of the output before the hard decision). In other mllib algorithms like logistic regression you can remove the threshold from the classifier to obtain the class probabilities but I can not find

Online (incremental) logistic regression in Spark [duplicate]

笑着哭i 提交于 2020-01-15 08:16:10
问题 This question already has answers here : Whether we can update existing model in spark-ml/spark-mllib? (2 answers) Closed 11 months ago . In Spark MLlib (RDD-based API) there is the StreamingLogisticRegressionWithSGD for incremental training of a Logistic Regression model. However, this class has been deprecated and offers little functionality (eg no access to model coefficients and output probabilities). In Spark ML (DataFrame-based API) I only find the class LogisticRegression , having only

KMeans|| for sentiment analysis on Spark

这一生的挚爱 提交于 2020-01-15 03:05:08
问题 I'm trying to write sentiment analysis program based on Spark. To do this I'm using word2vec and KMeans clustering. From word2Vec I've got 20k word/vectors collection in 100 dimension space and now I'm trying to clusterize this vectors space. When I run KMeans with default parallel implementation the algorithm worked 3 hours! But with random initialization strategy it was like 8 minutes. What am I doing wrong? I have mac book pro machine with 4 kernels processor and 16 GB of RAM. K ~= 4000