apache-spark-ml | 易学教程

Field “features” does not exist. SparkML

阅读更多关于 Field “features” does not exist. SparkML

问题 I am trying to build a model in Spark ML with Zeppelin. I am new to this area and would like some help. I think i need to set the correct datatypes to the column and set the first column as the label. Any help would be greatly appreciated, thank you val training = sc.textFile("hdfs:///ford/fordTrain.csv") val header = training.first val inferSchema = true val df = training.toDF val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8) val lrModel = lr.fit(df

pyspark: CrossValidator not work

阅读更多关于 pyspark: CrossValidator not work

问题 I'm trying to tune the parameters of an ALS but always choose the first parameter as best option from pyspark.sql import SQLContext from pyspark import SparkConf, SparkContext from pyspark.ml.recommendation import ALS from pyspark.ml.tuning import CrossValidator, ParamGridBuilder from pyspark.ml.evaluation import RegressionEvaluator from math import sqrt from operator import add conf = (SparkConf() .setMaster("local[4]") .setAppName("Myapp") .set("spark.executor.memory", "2g")) sc =

How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark

阅读更多关于 How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark

问题 Supossed I have a Pipeline like this: val tokenizer = new Tokenizer().setInputCol("tweet").setOutputCol("words") val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol("words").setOutputCol("features") val idf = new IDF().setInputCol("features").setOutputCol("idffeatures") val nb = new org.apache.spark.ml.classification.NaiveBayes() val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, idf, nb)) val paramGrid = new ParamGridBuilder().addGrid(hashingTF.numFeatures,

Tagging columns as Categorical in Spark

阅读更多关于 Tagging columns as Categorical in Spark

I am currently using StringIndexer to convert lot of columns into unique integers for classification in RandomForestModel. I am also using a pipeline for the ML process. Some queries are How does the RandomForestModel know which columns are categorical. StringIndexer converts non--numerical to numerical but does it add some meta-data of somesort to indicate that it is a categorical column? In mllib.tree.RF there was parameter call categoricalInfo which indicated columns which are categorical. How does ml.tree.RF know which are since that is not present. Also, StringIndexer maps categories to

Get Column Names after columnSimilarties() Spark scala

阅读更多关于 Get Column Names after columnSimilarties() Spark scala

I'm trying to build item based collaborative filtering model with columnSimilarities() in spark. After using the columnsSimilarities() I want to assign the original column names back to the results in Spark scala. Runnable code to calculate columnSimilarities() on data frame. Data // rdd val rowsRdd: RDD[Row] = sc.parallelize( Seq( Row(2.0, 7.0, 1.0), Row(3.5, 2.5, 0.0), Row(7.0, 5.9, 0.0) ) ) // Schema val schema = new StructType() .add(StructField("item_1", DoubleType, true)) .add(StructField("item_2", DoubleType, true)) .add(StructField("item_3", DoubleType, true)) // Data frame val df =

Spark RandomForest training StackOverflow error

阅读更多关于 Spark RandomForest training StackOverflow error

I am running a training of my model and I am getting the StackOverflow error whenever I increase the maxDepth over 12. Everything works correctly for 5,10,11. I am using spark 2.0.2 (and i cannot upgrade it for next couple of weeks). I have > 3M data, 200 features, 2500 trees and I would like to improve the accuracy by increasing the max depth. Is there a way to overcome this problem? Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 92 in stage 789.0 failed 4 times, most recent failure: Lost task 92.3 in stage 789.0 (TID 66903, 10.0.0.11): java

Applying IndexToString to features vector in Spark

阅读更多关于 Applying IndexToString to features vector in Spark

问题 Context: I have a data frame where all categorical values have been indexed using StringIndexer. val categoricalColumns = df.schema.collect { case StructField(name, StringType, nullable, meta) => name } val categoryIndexers = categoricalColumns.map { col => new StringIndexer().setInputCol(col).setOutputCol(s"${col}Indexed") } Then I used VectorAssembler to vectorize all feature columns (including the indexed categorical ones). val assembler = new VectorAssembler() .setInputCols(dfIndexed

Handling NULL values in Spark StringIndexer

阅读更多关于 Handling NULL values in Spark StringIndexer

I have a dataset with some categorical string columns and I want to represent them in double type. I used StringIndexer for this convertion and It works but when I tried it in another dataset that has NULL values it gave java.lang.NullPointerException error and did not work. For better understanding here is my code: for(col <- cols){ out_name = col ++ "_" var indexer = new StringIndexer().setInputCol(col).setOutputCol(out_name) var indexed = indexer.fit(df).transform(df) df = (indexed.withColumn(col, indexed(out_name))).drop(out_name) } So how can I solve this NULL data problem with

SPARK, ML, Tuning, CrossValidator: access the metrics

阅读更多关于 SPARK, ML, Tuning, CrossValidator: access the metrics

问题 In order to build a NaiveBayes multiclass classifier, I am using a CrossValidator to select the best parameters in my pipeline: val cv = new CrossValidator() .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid) .setEvaluator(new MulticlassClassificationEvaluator) .setNumFolds(10) val cvModel = cv.fit(trainingSet) The pipeline contains usual transformers and estimators in the following order: Tokenizer, StopWordsRemover, HashingTF, IDF and finally the NaiveBayes. Is it possible to access

How to deserialize Pipeline model in spark.ml?

阅读更多关于 How to deserialize Pipeline model in spark.ml?

I have serialized a Spark ML Pipeline model that consists of a number of TransformerS (org.apache.spark.ml.Transformer) and several Logistic Regression learners (org.apache.spark.ml.classification.LogisticRegression). It all works fine on my Windows machine where I created the model. I serialized the model to disk using java.io.ObjectOutputStream and read it back in using java.io.ObjectInputStream. It all works fine via sbt and my corresponding unit tests. However, when I assemble my code into a jar and try to run the same code in the Spark shell on my server, I get a ClassNotFoundException