apache-spark-ml | 易学教程

spark.ml StringIndexer throws 'Unseen label' on fit()

阅读更多关于 spark.ml StringIndexer throws 'Unseen label' on fit()

问题 I'm preparing a toy spark.ml example. Spark version 1.6.0 , running on top of Oracle JDK version 1.8.0_65 , pyspark, ipython notebook. First, it hardly has anything to do with Spark, ML, StringIndexer: handling unseen labels. The exception is thrown while fitting a pipeline to a dataset, not transforming it. And suppressing the exception might not be a solution here, since, I'm afraid, the dataset gets messed pretty bad in this case. My dataset is about 800Mb uncompressed, so it might be hard

Dropping a nested column from Spark DataFrame

阅读更多关于 Dropping a nested column from Spark DataFrame

问题 I have a DataFrame with the schema root |-- label: string (nullable = true) |-- features: struct (nullable = true) | |-- feat1: string (nullable = true) | |-- feat2: string (nullable = true) | |-- feat3: string (nullable = true) While, I am able to filter the data frame using val data = rawData .filter( !(rawData("features.feat1") <=> "100") ) I am unable to drop the columns using val data = rawData .drop("features.feat1") Is it something that I am doing wrong here? I also tried

How to access element of a VectorUDT column in a Spark DataFrame?

阅读更多关于 How to access element of a VectorUDT column in a Spark DataFrame?

问题 I have a dataframe df with a VectorUDT column named features . How do I get an element of the column, say first element? I've tried doing the following from pyspark.sql.functions import udf first_elem_udf = udf(lambda row: row.values[0]) df.select(first_elem_udf(df.features)).show() but I get a net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict(for numpy.dtype) error. Same error if I do first_elem_udf = first_elem_udf(lambda row: row.toArray()[0])

How to handle categorical features with spark-ml?

阅读更多关于 How to handle categorical features with spark-ml?

问题 How do I handle categorical data with spark-ml and not spark-mllib ? Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier , LogisticRegression , have a featuresCol argument, which specifies the name of the column of features in the DataFrame , and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame . Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to

pyspark extract ROC curve?

阅读更多关于 pyspark extract ROC curve?

问题 Is there a way to get the points on an ROC curve from Spark ML in pyspark? In the documentation I see an example for Scala but not python: https://spark.apache.org/docs/2.1.0/mllib-evaluation-metrics.html Is that right? I can certainly think of ways to implement it but I have to imagine it’s faster if there’s a pre-built function. I’m working with 3 million scores and a few dozen models so speed matters. Thanks! 回答1: As long as the ROC curve is a plot of FPR against TPR, you can extract the

Is there in PySpark a parameter equivalent to scikit-learn's sample_weight?

阅读更多关于 Is there in PySpark a parameter equivalent to scikit-learn's sample_weight?

问题 I am currently using the SGDClassifier provided by the scikit-learn library. When I use the fit method I can set the sample_weight parameter: Weights applied to individual samples. If not provided, uniform weights are assumed. These weights will be multiplied with class_weight (passed through the constructor) if class_weight is specified I want to switch to PySpark and to use the LogisticRegression class. Anyway I cannot find a parameter similar to sample_weight . There is a weightCol

How can I access computed metrics for each fold in a CrossValidatorModel

阅读更多关于 How can I access computed metrics for each fold in a CrossValidatorModel

问题 How can I get the computed metrics for each fold from a CrossValidatorModel in spark.ml ? I know I can get the average metrics using model.avgMetrics but is it possible to get the raw results on each fold to look at eg. the variance of the results? I am using Spark 2.0.0. 回答1: Studying the spark code here For the folds, you can do the iteration yourself like this: val splits = MLUtils.kFold(dataset.toDF.rdd, $(numFolds), $(seed)) //K-folding operation starting //for each fold you have

Queries with streaming sources must be executed with writeStream.start();;

阅读更多关于 Queries with streaming sources must be executed with writeStream.start();;

问题 I am trying to read data from Kafka using spark structured streaming and predict form incoming data. I'm using model which I have trained using Spark ML. val spark = SparkSession .builder() .appName("Spark SQL basic example") .master("local") .getOrCreate() import spark.implicits._ val toString = udf((payload: Array[Byte]) => new String(payload)) val sentenceDataFrame = spark.readStream.format("kafka").option("kafka.bootstrap.servers","localhost:9092").option("subscribe", "topicname1") .load(

Cannot evaluate ML model on Structured Streaming, because RDD transformations and actions are invoked inside other transformations

阅读更多关于 Cannot evaluate ML model on Structured Streaming, because RDD transformations and actions are invoked inside other transformations

问题 This is a well-known limitation[1] of Structured Streaming that I'm trying to get around using a custom sink. In what follows, modelsMap is a map of string keys to org.apache.spark.mllib.stat.KernelDensity models and streamingData is a streaming dataframe org.apache.spark.sql.DataFrame = [id1: string, id2: string ... 6 more fields] I'm trying to evaluate each row of streamingData against its corresponding model from modelsMap , enhance each row with prediction , and write to Kakfa. An obvious

Checkpoint RDD ReliableCheckpointRDD has different number of partitions from original RDD

阅读更多关于 Checkpoint RDD ReliableCheckpointRDD has different number of partitions from original RDD

问题 I have a spark cluster of two machines and I when I run a spark streaming application I get the following errors : Exception in thread "main" org.apache.spark.SparkException: Checkpoint RDD ReliableCheckpointRDD[11] at print at StatefulNetworkWordCount.scala:78(1) has different number of partitions from original RDD MapPartitionsRDD[10] at updateStateByKey at StatefulNetworkWordCount.scala:76(2) at org.apache.spark.rdd.ReliableRDDCheckpointData.doCheckpoint(ReliableRDDCheckpointData.scala:73)