apache-spark-ml

Field “features” does not exist. SparkML

我怕爱的太早我们不能终老 提交于 2019-12-06 19:12:03
问题 I am trying to build a model in Spark ML with Zeppelin. I am new to this area and would like some help. I think i need to set the correct datatypes to the column and set the first column as the label. Any help would be greatly appreciated, thank you val training = sc.textFile("hdfs:///ford/fordTrain.csv") val header = training.first val inferSchema = true val df = training.toDF val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8) val lrModel = lr.fit(df

pyspark: CrossValidator not work

北城余情 提交于 2019-12-06 11:53:45
问题 I'm trying to tune the parameters of an ALS but always choose the first parameter as best option from pyspark.sql import SQLContext from pyspark import SparkConf, SparkContext from pyspark.ml.recommendation import ALS from pyspark.ml.tuning import CrossValidator, ParamGridBuilder from pyspark.ml.evaluation import RegressionEvaluator from math import sqrt from operator import add conf = (SparkConf() .setMaster("local[4]") .setAppName("Myapp") .set("spark.executor.memory", "2g")) sc =

How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark

回眸只為那壹抹淺笑 提交于 2019-12-06 10:59:32
问题 Supossed I have a Pipeline like this: val tokenizer = new Tokenizer().setInputCol("tweet").setOutputCol("words") val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol("words").setOutputCol("features") val idf = new IDF().setInputCol("features").setOutputCol("idffeatures") val nb = new org.apache.spark.ml.classification.NaiveBayes() val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, idf, nb)) val paramGrid = new ParamGridBuilder().addGrid(hashingTF.numFeatures,

Tagging columns as Categorical in Spark

て烟熏妆下的殇ゞ 提交于 2019-12-06 09:59:30
I am currently using StringIndexer to convert lot of columns into unique integers for classification in RandomForestModel. I am also using a pipeline for the ML process. Some queries are How does the RandomForestModel know which columns are categorical. StringIndexer converts non--numerical to numerical but does it add some meta-data of somesort to indicate that it is a categorical column? In mllib.tree.RF there was parameter call categoricalInfo which indicated columns which are categorical. How does ml.tree.RF know which are since that is not present. Also, StringIndexer maps categories to

Get Column Names after columnSimilarties() Spark scala

a 夏天 提交于 2019-12-06 07:25:17
I'm trying to build item based collaborative filtering model with columnSimilarities() in spark. After using the columnsSimilarities() I want to assign the original column names back to the results in Spark scala. Runnable code to calculate columnSimilarities() on data frame. Data // rdd val rowsRdd: RDD[Row] = sc.parallelize( Seq( Row(2.0, 7.0, 1.0), Row(3.5, 2.5, 0.0), Row(7.0, 5.9, 0.0) ) ) // Schema val schema = new StructType() .add(StructField("item_1", DoubleType, true)) .add(StructField("item_2", DoubleType, true)) .add(StructField("item_3", DoubleType, true)) // Data frame val df =

Spark RandomForest training StackOverflow error

别说谁变了你拦得住时间么 提交于 2019-12-06 05:59:25
I am running a training of my model and I am getting the StackOverflow error whenever I increase the maxDepth over 12. Everything works correctly for 5,10,11. I am using spark 2.0.2 (and i cannot upgrade it for next couple of weeks). I have > 3M data, 200 features, 2500 trees and I would like to improve the accuracy by increasing the max depth. Is there a way to overcome this problem? Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 92 in stage 789.0 failed 4 times, most recent failure: Lost task 92.3 in stage 789.0 (TID 66903, 10.0.0.11): java

Applying IndexToString to features vector in Spark

半腔热情 提交于 2019-12-06 02:56:30
问题 Context: I have a data frame where all categorical values have been indexed using StringIndexer. val categoricalColumns = df.schema.collect { case StructField(name, StringType, nullable, meta) => name } val categoryIndexers = categoricalColumns.map { col => new StringIndexer().setInputCol(col).setOutputCol(s"${col}Indexed") } Then I used VectorAssembler to vectorize all feature columns (including the indexed categorical ones). val assembler = new VectorAssembler() .setInputCols(dfIndexed

Handling NULL values in Spark StringIndexer

本秂侑毒 提交于 2019-12-06 00:32:55
I have a dataset with some categorical string columns and I want to represent them in double type. I used StringIndexer for this convertion and It works but when I tried it in another dataset that has NULL values it gave java.lang.NullPointerException error and did not work. For better understanding here is my code: for(col <- cols){ out_name = col ++ "_" var indexer = new StringIndexer().setInputCol(col).setOutputCol(out_name) var indexed = indexer.fit(df).transform(df) df = (indexed.withColumn(col, indexed(out_name))).drop(out_name) } So how can I solve this NULL data problem with

SPARK, ML, Tuning, CrossValidator: access the metrics

早过忘川 提交于 2019-12-05 20:17:31
问题 In order to build a NaiveBayes multiclass classifier, I am using a CrossValidator to select the best parameters in my pipeline: val cv = new CrossValidator() .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid) .setEvaluator(new MulticlassClassificationEvaluator) .setNumFolds(10) val cvModel = cv.fit(trainingSet) The pipeline contains usual transformers and estimators in the following order: Tokenizer, StopWordsRemover, HashingTF, IDF and finally the NaiveBayes. Is it possible to access

How to deserialize Pipeline model in spark.ml?

孤街浪徒 提交于 2019-12-05 17:53:39
I have serialized a Spark ML Pipeline model that consists of a number of TransformerS (org.apache.spark.ml.Transformer) and several Logistic Regression learners (org.apache.spark.ml.classification.LogisticRegression). It all works fine on my Windows machine where I created the model. I serialized the model to disk using java.io.ObjectOutputStream and read it back in using java.io.ObjectInputStream. It all works fine via sbt and my corresponding unit tests. However, when I assemble my code into a jar and try to run the same code in the Spark shell on my server, I get a ClassNotFoundException