apache-spark-ml

spark.ml StringIndexer throws 'Unseen label' on fit()

这一生的挚爱 提交于 2019-12-17 04:34:19
问题 I'm preparing a toy spark.ml example. Spark version 1.6.0 , running on top of Oracle JDK version 1.8.0_65 , pyspark, ipython notebook. First, it hardly has anything to do with Spark, ML, StringIndexer: handling unseen labels. The exception is thrown while fitting a pipeline to a dataset, not transforming it. And suppressing the exception might not be a solution here, since, I'm afraid, the dataset gets messed pretty bad in this case. My dataset is about 800Mb uncompressed, so it might be hard

Dropping a nested column from Spark DataFrame

故事扮演 提交于 2019-12-17 04:31:56
问题 I have a DataFrame with the schema root |-- label: string (nullable = true) |-- features: struct (nullable = true) | |-- feat1: string (nullable = true) | |-- feat2: string (nullable = true) | |-- feat3: string (nullable = true) While, I am able to filter the data frame using val data = rawData .filter( !(rawData("features.feat1") <=> "100") ) I am unable to drop the columns using val data = rawData .drop("features.feat1") Is it something that I am doing wrong here? I also tried

How to access element of a VectorUDT column in a Spark DataFrame?

99封情书 提交于 2019-12-17 03:17:50
问题 I have a dataframe df with a VectorUDT column named features . How do I get an element of the column, say first element? I've tried doing the following from pyspark.sql.functions import udf first_elem_udf = udf(lambda row: row.values[0]) df.select(first_elem_udf(df.features)).show() but I get a net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict(for numpy.dtype) error. Same error if I do first_elem_udf = first_elem_udf(lambda row: row.toArray()[0])

How to handle categorical features with spark-ml?

天涯浪子 提交于 2019-12-17 02:34:30
问题 How do I handle categorical data with spark-ml and not spark-mllib ? Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier , LogisticRegression , have a featuresCol argument, which specifies the name of the column of features in the DataFrame , and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame . Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to

pyspark extract ROC curve?

人盡茶涼 提交于 2019-12-14 03:56:34
问题 Is there a way to get the points on an ROC curve from Spark ML in pyspark? In the documentation I see an example for Scala but not python: https://spark.apache.org/docs/2.1.0/mllib-evaluation-metrics.html Is that right? I can certainly think of ways to implement it but I have to imagine it’s faster if there’s a pre-built function. I’m working with 3 million scores and a few dozen models so speed matters. Thanks! 回答1: As long as the ROC curve is a plot of FPR against TPR, you can extract the

Is there in PySpark a parameter equivalent to scikit-learn's sample_weight?

匆匆过客 提交于 2019-12-13 04:38:41
问题 I am currently using the SGDClassifier provided by the scikit-learn library. When I use the fit method I can set the sample_weight parameter: Weights applied to individual samples. If not provided, uniform weights are assumed. These weights will be multiplied with class_weight (passed through the constructor) if class_weight is specified I want to switch to PySpark and to use the LogisticRegression class. Anyway I cannot find a parameter similar to sample_weight . There is a weightCol

How can I access computed metrics for each fold in a CrossValidatorModel

懵懂的女人 提交于 2019-12-13 03:37:54
问题 How can I get the computed metrics for each fold from a CrossValidatorModel in spark.ml ? I know I can get the average metrics using model.avgMetrics but is it possible to get the raw results on each fold to look at eg. the variance of the results? I am using Spark 2.0.0. 回答1: Studying the spark code here For the folds, you can do the iteration yourself like this: val splits = MLUtils.kFold(dataset.toDF.rdd, $(numFolds), $(seed)) //K-folding operation starting //for each fold you have

Queries with streaming sources must be executed with writeStream.start();;

旧城冷巷雨未停 提交于 2019-12-12 20:37:41
问题 I am trying to read data from Kafka using spark structured streaming and predict form incoming data. I'm using model which I have trained using Spark ML. val spark = SparkSession .builder() .appName("Spark SQL basic example") .master("local") .getOrCreate() import spark.implicits._ val toString = udf((payload: Array[Byte]) => new String(payload)) val sentenceDataFrame = spark.readStream.format("kafka").option("kafka.bootstrap.servers","localhost:9092").option("subscribe", "topicname1") .load(

Cannot evaluate ML model on Structured Streaming, because RDD transformations and actions are invoked inside other transformations

情到浓时终转凉″ 提交于 2019-12-12 17:45:19
问题 This is a well-known limitation[1] of Structured Streaming that I'm trying to get around using a custom sink. In what follows, modelsMap is a map of string keys to org.apache.spark.mllib.stat.KernelDensity models and streamingData is a streaming dataframe org.apache.spark.sql.DataFrame = [id1: string, id2: string ... 6 more fields] I'm trying to evaluate each row of streamingData against its corresponding model from modelsMap , enhance each row with prediction , and write to Kakfa. An obvious

Checkpoint RDD ReliableCheckpointRDD has different number of partitions from original RDD

和自甴很熟 提交于 2019-12-12 08:55:09
问题 I have a spark cluster of two machines and I when I run a spark streaming application I get the following errors : Exception in thread "main" org.apache.spark.SparkException: Checkpoint RDD ReliableCheckpointRDD[11] at print at StatefulNetworkWordCount.scala:78(1) has different number of partitions from original RDD MapPartitionsRDD[10] at updateStateByKey at StatefulNetworkWordCount.scala:76(2) at org.apache.spark.rdd.ReliableRDDCheckpointData.doCheckpoint(ReliableRDDCheckpointData.scala:73)