apache-spark-ml | 易学教程

StandardScaler in Spark not working as expected

阅读更多关于 StandardScaler in Spark not working as expected

问题 Any idea why spark would be doing this for StandardScaler ? As per the definition of StandardScaler : The StandardScaler standardizes a set of features to have zero mean and a standard deviation of 1. The flag withStd will scale the data to unit standard deviation while the flag withMean (false by default) will center the data prior to scaling it. >>> tmpdf.show(4) +----+----+----+------------+ |int1|int2|int3|temp_feature| +----+----+----+------------+ | 1| 2| 3| [2.0]| | 7| 8| 9| [8.0]| | 4

SparkException: Values to assemble cannot be null

阅读更多关于 SparkException: Values to assemble cannot be null

问题 I want use StandardScaler to normalize the features. Here is my code: val Array(trainingData, testData) = dataset.randomSplit(Array(0.7,0.3)) val vectorAssembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features").transform(trainingData) val stdscaler = new StandardScaler().setInputCol("features").setOutputCol("scaledFeatures").setWithStd(true).setWithMean(false).fit(vectorAssembler) but it threw out an exception when I tried to use StandardScaler [Stage 151:==> (9 + 2) /

Error when passing data from a Dataframe into an existing ML VectorIndexerModel

阅读更多关于 Error when passing data from a Dataframe into an existing ML VectorIndexerModel

问题 I have a Dataframe which I want to use for prediction with an existing model. I get an error when using the transform method of my model. This is how I process the trainingdata. forecast.printSchema() The schema of my Dataframe: root |-- PM10: double (nullable = false) |-- rain_3h: double (nullable = false) |-- is_rain: double (nullable = false) |-- wind_deg: double (nullable = false) |-- wind_speed: double (nullable = false) |-- humidity: double (nullable = false) |-- is_newYear: double

How to convert RDD of dense vector into DataFrame in pyspark?

阅读更多关于 How to convert RDD of dense vector into DataFrame in pyspark?

问题 I have a DenseVector RDD like this >>> frequencyDenseVectors.collect() [DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0])] I want to convert this into a Dataframe . I tried like this >>> spark.createDataFrame

How to cross validate RandomForest model?

阅读更多关于 How to cross validate RandomForest model?

问题 I want to evaluate a random forest being trained on some data. Is there any utility in Apache Spark to do the same or do I have to perform cross validation manually? 回答1: ML provides CrossValidator class which can be used to perform cross-validation and parameter search. Assuming your data is already preprocessed you can add cross-validation as follows: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator} import org.apache.spark.ml

Create feature vector programmatically in Spark ML / pyspark

阅读更多关于 Create feature vector programmatically in Spark ML / pyspark

问题 I'm wondering if there is a concise way to run ML (e.g KMeans) on a DataFrame in pyspark if I have the features in multiple numeric columns. I.e. as in the Iris dataset: (a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1) I'd like to use KMeans without recreating the DataSet with the feature vector added manually as a new column and the original columns hardcoded repeatedly in the code. The solution I'd like to improve: from pyspark.mllib.linalg import Vectors

Spark Multiclass Classification Example

阅读更多关于 Spark Multiclass Classification Example

问题 Do you guys know where can I find examples of multiclass classification in Spark. I spent a lot of time searching in books and in the web, and so far I just know that it is possible since the latest version according the documentation. 回答1: ML ( Recommended in Spark 2.0+ ) We'll use the same data as in the MLlib below. There are two basic options. If Estimator supports multilclass classification out-of-the-box (for example random forest) you can use it directly: val trainRawDf = trainRaw.toDF

How do I convert an array (i.e. list) column to Vector

阅读更多关于 How do I convert an array (i.e. list) column to Vector

问题 Short version of the question! Consider the following snippet (assuming spark is already set to some SparkSession ): from pyspark.sql import Row source_data = [ Row(city="Chicago", temperatures=[-1.0, -2.0, -3.0]), Row(city="New York", temperatures=[-7.0, -7.0, -5.0]), ] df = spark.createDataFrame(source_data) Notice that the temperatures field is a list of floats. I would like to convert these lists of floats to the MLlib type Vector , and I'd like this conversion to be expressed using the

Tuning parameters for implicit pyspark.ml ALS matrix factorization model through pyspark.ml CrossValidator

阅读更多关于 Tuning parameters for implicit pyspark.ml ALS matrix factorization model through pyspark.ml CrossValidator

问题 I'm trying to tune the parameters of an ALS matrix factorization model that uses implicit data. For this, I'm trying to use pyspark.ml.tuning.CrossValidator to run through a parameter grid and select the best model. I believe my problem is in the evaluator, but I can't figure it out. I can get this to work for an explicit data model with a regression RMSE evaluator, as follows: from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext from pyspark.ml.recommendation import

How to find mean of grouped Vector columns in Spark SQL?

阅读更多关于 How to find mean of grouped Vector columns in Spark SQL?

问题 I have created a RelationalGroupedDataset by calling instances.groupBy(instances.col("property_name")) : val x = instances.groupBy(instances.col("property_name")) How do I compose a user-defined aggregate function to perform Statistics.colStats().mean on each group? Thanks! 回答1: Spark >= 2.4 You can use Summarizer : import org.apache.spark.ml.stat.Summarizer val dfNew = df.as[(Int, org.apache.spark.mllib.linalg.Vector)] .map { case (group, v) => (group, v.asML) } .toDF("group", "features")