apache-spark-ml

StandardScaler in Spark not working as expected

梦想的初衷 提交于 2019-12-18 06:56:14
问题 Any idea why spark would be doing this for StandardScaler ? As per the definition of StandardScaler : The StandardScaler standardizes a set of features to have zero mean and a standard deviation of 1. The flag withStd will scale the data to unit standard deviation while the flag withMean (false by default) will center the data prior to scaling it. >>> tmpdf.show(4) +----+----+----+------------+ |int1|int2|int3|temp_feature| +----+----+----+------------+ | 1| 2| 3| [2.0]| | 7| 8| 9| [8.0]| | 4

SparkException: Values to assemble cannot be null

拜拜、爱过 提交于 2019-12-18 03:34:06
问题 I want use StandardScaler to normalize the features. Here is my code: val Array(trainingData, testData) = dataset.randomSplit(Array(0.7,0.3)) val vectorAssembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features").transform(trainingData) val stdscaler = new StandardScaler().setInputCol("features").setOutputCol("scaledFeatures").setWithStd(true).setWithMean(false).fit(vectorAssembler) but it threw out an exception when I tried to use StandardScaler [Stage 151:==> (9 + 2) /

Error when passing data from a Dataframe into an existing ML VectorIndexerModel

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-17 21:08:03
问题 I have a Dataframe which I want to use for prediction with an existing model. I get an error when using the transform method of my model. This is how I process the trainingdata. forecast.printSchema() The schema of my Dataframe: root |-- PM10: double (nullable = false) |-- rain_3h: double (nullable = false) |-- is_rain: double (nullable = false) |-- wind_deg: double (nullable = false) |-- wind_speed: double (nullable = false) |-- humidity: double (nullable = false) |-- is_newYear: double

How to convert RDD of dense vector into DataFrame in pyspark?

江枫思渺然 提交于 2019-12-17 18:56:36
问题 I have a DenseVector RDD like this >>> frequencyDenseVectors.collect() [DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0])] I want to convert this into a Dataframe . I tried like this >>> spark.createDataFrame

How to cross validate RandomForest model?

倖福魔咒の 提交于 2019-12-17 18:34:50
问题 I want to evaluate a random forest being trained on some data. Is there any utility in Apache Spark to do the same or do I have to perform cross validation manually? 回答1: ML provides CrossValidator class which can be used to perform cross-validation and parameter search. Assuming your data is already preprocessed you can add cross-validation as follows: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator} import org.apache.spark.ml

Create feature vector programmatically in Spark ML / pyspark

梦想的初衷 提交于 2019-12-17 18:22:39
问题 I'm wondering if there is a concise way to run ML (e.g KMeans) on a DataFrame in pyspark if I have the features in multiple numeric columns. I.e. as in the Iris dataset: (a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1) I'd like to use KMeans without recreating the DataSet with the feature vector added manually as a new column and the original columns hardcoded repeatedly in the code. The solution I'd like to improve: from pyspark.mllib.linalg import Vectors

Spark Multiclass Classification Example

前提是你 提交于 2019-12-17 18:04:56
问题 Do you guys know where can I find examples of multiclass classification in Spark. I spent a lot of time searching in books and in the web, and so far I just know that it is possible since the latest version according the documentation. 回答1: ML ( Recommended in Spark 2.0+ ) We'll use the same data as in the MLlib below. There are two basic options. If Estimator supports multilclass classification out-of-the-box (for example random forest) you can use it directly: val trainRawDf = trainRaw.toDF

How do I convert an array (i.e. list) column to Vector

萝らか妹 提交于 2019-12-17 15:20:58
问题 Short version of the question! Consider the following snippet (assuming spark is already set to some SparkSession ): from pyspark.sql import Row source_data = [ Row(city="Chicago", temperatures=[-1.0, -2.0, -3.0]), Row(city="New York", temperatures=[-7.0, -7.0, -5.0]), ] df = spark.createDataFrame(source_data) Notice that the temperatures field is a list of floats. I would like to convert these lists of floats to the MLlib type Vector , and I'd like this conversion to be expressed using the

Tuning parameters for implicit pyspark.ml ALS matrix factorization model through pyspark.ml CrossValidator

独自空忆成欢 提交于 2019-12-17 07:18:49
问题 I'm trying to tune the parameters of an ALS matrix factorization model that uses implicit data. For this, I'm trying to use pyspark.ml.tuning.CrossValidator to run through a parameter grid and select the best model. I believe my problem is in the evaluator, but I can't figure it out. I can get this to work for an explicit data model with a regression RMSE evaluator, as follows: from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext from pyspark.ml.recommendation import

How to find mean of grouped Vector columns in Spark SQL?

为君一笑 提交于 2019-12-17 06:51:45
问题 I have created a RelationalGroupedDataset by calling instances.groupBy(instances.col("property_name")) : val x = instances.groupBy(instances.col("property_name")) How do I compose a user-defined aggregate function to perform Statistics.colStats().mean on each group? Thanks! 回答1: Spark >= 2.4 You can use Summarizer : import org.apache.spark.ml.stat.Summarizer val dfNew = df.as[(Int, org.apache.spark.mllib.linalg.Vector)] .map { case (group, v) => (group, v.asML) } .toDF("group", "features")