apache-spark-ml

Access element of a vector in a Spark DataFrame (Logistic Regression probability vector) [duplicate]

泄露秘密 提交于 2019-11-29 00:54:01
问题 This question already has answers here : How to access element of a VectorUDT column in a Spark DataFrame? (2 answers) Closed 2 years ago . I trained a LogisticRegression model in PySpark (ML package) and the result of the prediction is a PySpark DataFrame ( cv_predictions ) (see [1]). The probability column (see [2]) is a vector type (see [3]). [1] type(cv_predictions_prod) pyspark.sql.dataframe.DataFrame [2] cv_predictions_prod.select('probability').show(10, False) +------------------------

How to map variable names to features after pipeline

非 Y 不嫁゛ 提交于 2019-11-28 21:59:06
I have modified the OneHotEncoder example to actually train a LogisticRegression. My question is how to map the generated weights back to the categorical variables? def oneHotEncoderExample(sqlContext: SQLContext): Unit = { val df = sqlContext.createDataFrame(Seq( (0, "a", 1.0), (1, "b", 1.0), (2, "c", 0.0), (3, "d", 1.0), (4, "e", 1.0), (5, "f", 0.0) )).toDF("id", "category", "label") df.show() val indexer = new StringIndexer() .setInputCol("category") .setOutputCol("categoryIndex") .fit(df) val indexed = indexer.transform(df) indexed.select("id", "categoryIndex").show() val encoder = new

Access element of a vector in a Spark DataFrame (Logistic Regression probability vector) [duplicate]

我只是一个虾纸丫 提交于 2019-11-28 20:33:53
This question already has an answer here: How to access element of a VectorUDT column in a Spark DataFrame? 1 answer I trained a LogisticRegression model in PySpark (ML package) and the result of the prediction is a PySpark DataFrame ( cv_predictions ) (see [1]). The probability column (see [2]) is a vector type (see [3]). [1] type(cv_predictions_prod) pyspark.sql.dataframe.DataFrame [2] cv_predictions_prod.select('probability').show(10, False) +----------------------------------------+ |probability | +----------------------------------------+ |[0.31559134817066054,0.6844086518293395]| |[0

How to extract model hyper-parameters from spark.ml in PySpark?

前提是你 提交于 2019-11-28 19:22:22
I'm tinkering with some cross-validation code from the PySpark documentation, and trying to get PySpark to tell me what model was selected: from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr =

How to prepare data into a LibSVM format from DataFrame?

心不动则不痛 提交于 2019-11-28 17:48:12
问题 I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that the desired libsvm type is user item:rating . If you know what to do in the current situation : val ratings = sc.textFile(new File("/user/ubuntu/kang/0829/rawRatings.csv").toString).map { line => val fields = line.split(",") (fields(0).toInt,fields(1).toInt,fields(2).toDouble) } val user = ratings.map{ case (user,product

KMeans clustering in PySpark

此生再无相见时 提交于 2019-11-28 17:15:06
问题 I have a spark dataframe 'mydataframe' with many columns. I am trying to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple values). I want to extract 7 clusters based on just those 2 columns and then I want to attach the cluster asignment to my original dataframe. I've tried: from numpy import array from math import sqrt from pyspark.mllib.clustering import KMeans, KMeansModel # Prepare a data frame with just 2 columns: data = mydataframe.select('lat',

Spark, Scala, DataFrame: create feature vectors

折月煮酒 提交于 2019-11-28 16:32:31
问题 I have a DataFrame that looks like follow: userID, category, frequency 1,cat1,1 1,cat2,3 1,cat9,5 2,cat4,6 2,cat9,2 2,cat10,1 3,cat1,5 3,cat7,16 3,cat8,2 The number of distinct categories is 10, and I would like to create a feature vector for each userID and fill the missing categories with zeros. So the output would be something like: userID,feature 1,[1,3,0,0,0,0,0,0,5,0] 2,[0,0,0,6,0,0,0,0,2,1] 3,[5,0,0,0,0,0,16,2,0,0] It is just an illustrative example, in reality I have about 200,000

VectorUDT usage

北城以北 提交于 2019-11-28 14:25:33
I have to get the datatype and do a case match and convert it to some required format. But the usage of org.apache.spark.ml.linalg.VectorUDT is showing VectorUDT is private . Also I specifically need to use org.apache.spark.ml.linalg.VectorUDT and not org.apache.spark.mllib.linalg.VectorUDT . Can someone suggest how to go about this? For org.apache.spark.ml.linalg types you should specify schema using org.apache.spark.ml.linalg.SQLDataTypes which provide singleton instances of the private UDT types : MatrixType for matrices ( org.apache.spark.ml.linalg.Matrix ). scala> org.apache.spark.ml

Error when passing data from a Dataframe into an existing ML VectorIndexerModel

一曲冷凌霜 提交于 2019-11-28 14:15:26
I have a Dataframe which I want to use for prediction with an existing model. I get an error when using the transform method of my model. This is how I process the trainingdata. forecast.printSchema() The schema of my Dataframe: root |-- PM10: double (nullable = false) |-- rain_3h: double (nullable = false) |-- is_rain: double (nullable = false) |-- wind_deg: double (nullable = false) |-- wind_speed: double (nullable = false) |-- humidity: double (nullable = false) |-- is_newYear: double (nullable = false) |-- season: double (nullable = false) |-- is_rushHour: double (nullable = false) |--

How to change column metadata in pyspark?

五迷三道 提交于 2019-11-28 13:50:32
How can I update column metadata in PySpark? I have metadata values corresponding to nominal encoding of categorical (string) features and I would like to decode them back in automated way. Writing the metadata in pyspark API is not directly available unless you recreate the schema. Is it possible to edit metadata in PySpark on the go without converting dataset to RDD and converting it back, provided complete schema description (as described here )? Example listing: # Create DF df.show() # +---+-------------+ # | id| features| # +---+-------------+ # | 0|[1.0,1.0,4.0]| # | 1|[2.0,2.0,4.0]| # +