apache-spark-ml | 易学教程

Access element of a vector in a Spark DataFrame (Logistic Regression probability vector) [duplicate]

阅读更多关于 Access element of a vector in a Spark DataFrame (Logistic Regression probability vector) [duplicate]

问题 This question already has answers here : How to access element of a VectorUDT column in a Spark DataFrame? (2 answers) Closed 2 years ago . I trained a LogisticRegression model in PySpark (ML package) and the result of the prediction is a PySpark DataFrame ( cv_predictions ) (see [1]). The probability column (see [2]) is a vector type (see [3]). [1] type(cv_predictions_prod) pyspark.sql.dataframe.DataFrame [2] cv_predictions_prod.select('probability').show(10, False) +------------------------

How to map variable names to features after pipeline

阅读更多关于 How to map variable names to features after pipeline

I have modified the OneHotEncoder example to actually train a LogisticRegression. My question is how to map the generated weights back to the categorical variables? def oneHotEncoderExample(sqlContext: SQLContext): Unit = { val df = sqlContext.createDataFrame(Seq( (0, "a", 1.0), (1, "b", 1.0), (2, "c", 0.0), (3, "d", 1.0), (4, "e", 1.0), (5, "f", 0.0) )).toDF("id", "category", "label") df.show() val indexer = new StringIndexer() .setInputCol("category") .setOutputCol("categoryIndex") .fit(df) val indexed = indexer.transform(df) indexed.select("id", "categoryIndex").show() val encoder = new

Access element of a vector in a Spark DataFrame (Logistic Regression probability vector) [duplicate]

阅读更多关于 Access element of a vector in a Spark DataFrame (Logistic Regression probability vector) [duplicate]

This question already has an answer here: How to access element of a VectorUDT column in a Spark DataFrame? 1 answer I trained a LogisticRegression model in PySpark (ML package) and the result of the prediction is a PySpark DataFrame ( cv_predictions ) (see [1]). The probability column (see [2]) is a vector type (see [3]). [1] type(cv_predictions_prod) pyspark.sql.dataframe.DataFrame [2] cv_predictions_prod.select('probability').show(10, False) +----------------------------------------+ |probability | +----------------------------------------+ |[0.31559134817066054,0.6844086518293395]| |[0

How to extract model hyper-parameters from spark.ml in PySpark?

阅读更多关于 How to extract model hyper-parameters from spark.ml in PySpark?

I'm tinkering with some cross-validation code from the PySpark documentation, and trying to get PySpark to tell me what model was selected: from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr =

How to prepare data into a LibSVM format from DataFrame?

阅读更多关于 How to prepare data into a LibSVM format from DataFrame?

问题 I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that the desired libsvm type is user item:rating . If you know what to do in the current situation : val ratings = sc.textFile(new File("/user/ubuntu/kang/0829/rawRatings.csv").toString).map { line => val fields = line.split(",") (fields(0).toInt,fields(1).toInt,fields(2).toDouble) } val user = ratings.map{ case (user,product

KMeans clustering in PySpark

阅读更多关于 KMeans clustering in PySpark

问题 I have a spark dataframe 'mydataframe' with many columns. I am trying to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple values). I want to extract 7 clusters based on just those 2 columns and then I want to attach the cluster asignment to my original dataframe. I've tried: from numpy import array from math import sqrt from pyspark.mllib.clustering import KMeans, KMeansModel # Prepare a data frame with just 2 columns: data = mydataframe.select('lat',

Spark, Scala, DataFrame: create feature vectors

阅读更多关于 Spark, Scala, DataFrame: create feature vectors

问题 I have a DataFrame that looks like follow: userID, category, frequency 1,cat1,1 1,cat2,3 1,cat9,5 2,cat4,6 2,cat9,2 2,cat10,1 3,cat1,5 3,cat7,16 3,cat8,2 The number of distinct categories is 10, and I would like to create a feature vector for each userID and fill the missing categories with zeros. So the output would be something like: userID,feature 1,[1,3,0,0,0,0,0,0,5,0] 2,[0,0,0,6,0,0,0,0,2,1] 3,[5,0,0,0,0,0,16,2,0,0] It is just an illustrative example, in reality I have about 200,000

VectorUDT usage

阅读更多关于 VectorUDT usage

I have to get the datatype and do a case match and convert it to some required format. But the usage of org.apache.spark.ml.linalg.VectorUDT is showing VectorUDT is private . Also I specifically need to use org.apache.spark.ml.linalg.VectorUDT and not org.apache.spark.mllib.linalg.VectorUDT . Can someone suggest how to go about this? For org.apache.spark.ml.linalg types you should specify schema using org.apache.spark.ml.linalg.SQLDataTypes which provide singleton instances of the private UDT types : MatrixType for matrices ( org.apache.spark.ml.linalg.Matrix ). scala> org.apache.spark.ml

Error when passing data from a Dataframe into an existing ML VectorIndexerModel

阅读更多关于 Error when passing data from a Dataframe into an existing ML VectorIndexerModel

I have a Dataframe which I want to use for prediction with an existing model. I get an error when using the transform method of my model. This is how I process the trainingdata. forecast.printSchema() The schema of my Dataframe: root |-- PM10: double (nullable = false) |-- rain_3h: double (nullable = false) |-- is_rain: double (nullable = false) |-- wind_deg: double (nullable = false) |-- wind_speed: double (nullable = false) |-- humidity: double (nullable = false) |-- is_newYear: double (nullable = false) |-- season: double (nullable = false) |-- is_rushHour: double (nullable = false) |--

How to change column metadata in pyspark?

阅读更多关于 How to change column metadata in pyspark?

How can I update column metadata in PySpark? I have metadata values corresponding to nominal encoding of categorical (string) features and I would like to decode them back in automated way. Writing the metadata in pyspark API is not directly available unless you recreate the schema. Is it possible to edit metadata in PySpark on the go without converting dataset to RDD and converting it back, provided complete schema description (as described here )? Example listing: # Create DF df.show() # +---+-------------+ # | id| features| # +---+-------------+ # | 0|[1.0,1.0,4.0]| # | 1|[2.0,2.0,4.0]| # +