apache-spark-mllib | 易学教程

Spark Get only columns that have one or more null values

阅读更多关于 Spark Get only columns that have one or more null values

问题 From a dataframe I want to get names of columns which contain at least one null value inside. Considering the dataframe below: val dataset = sparkSession.createDataFrame(Seq( (7, null, 18, 1.0), (8, "CA", null, 0.0), (9, "NZ", 15, 0.0) )).toDF("id", "country", "hour", "clicked") I want to get column names 'Country' and 'Hour'. id country hour clicked 7 null 18 1 8 "CA" null 0 9 "NZ" 15 0 回答1: this is one solution, but it's a bit awkward, I hope there is an easier way: val cols = dataset

Spark.ml regressions do not calculate same models as scikit-learn

阅读更多关于 Spark.ml regressions do not calculate same models as scikit-learn

问题 I am setting up a very simple logistic regression problem in scikit-learn and in spark.ml, and the results diverge: the models they learn are different, but I can't figure out why (data is the same, model type is the same, regularization is the same...). No doubt I am missing some setting on one side or the other. Which setting? How should I set up either scikit or spark.ml to find the same model as its counterpart? I give the sklearn code and spark.ml code below. Both should be ready to cut

Why does ALS.trainImplicit give better predictions for explicit ratings?

阅读更多关于 Why does ALS.trainImplicit give better predictions for explicit ratings?

问题 Edit: I tried a standalone Spark application (instead of PredictionIO) and my observations are the same. So this is not a PredictionIO issue, but still confusing. I am using PredictionIO 0.9.6 and the Recommendation template for collaborative filtering. The ratings in my data set are numbers between 1 and 10. When I first trained a model with defaults from the template (using ALS.train ), the predictions were horrible, at least subjectively. Scores ranged up to 60.0 or so but the

pyspark - Convert sparse vector obtained after one hot encoding into columns

阅读更多关于 pyspark - Convert sparse vector obtained after one hot encoding into columns

问题 I am using apache Spark ML lib to handle categorical features using one hot encoding. After writing the below code I am getting a vector c_idx_vec as output of one hot encoding. I do understand how to interpret this output vector but I am unable to figure out how to convert this vector into columns so that I get a new transformed dataframe.Take this dataset for example: >>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"]) >>> ss = StringIndexer

PySpark: Getting output layer neuron values for Spark ML Multilayer Perceptron Classifier

阅读更多关于 PySpark: Getting output layer neuron values for Spark ML Multilayer Perceptron Classifier

问题 I am doing binary classification using Spark ML Multilayer Perceptron Classifier. mlp = MultilayerPerceptronClassifier(labelCol="evt", featuresCol="features", layers=[inputneurons,(inputneurons*2)+1,2]) The output layer has of two neurons as it is a binary classification problem. Now I would like get the values two neurons for each of the rows in the test set instead of just getting the prediction column containing either 0 or 1. I could not find anything to get that in the API document. 回答1:

PySpark: Getting output layer neuron values for Spark ML Multilayer Perceptron Classifier

阅读更多关于 PySpark: Getting output layer neuron values for Spark ML Multilayer Perceptron Classifier

convert dataframe to libsvm format

阅读更多关于 convert dataframe to libsvm format

问题 I have a dataframe resulting from a sql query df1 = sqlContext.sql("select * from table_test") I need to convert this dataframe to libsvm format so that it can be provided as an input for pyspark.ml.classification.LogisticRegression I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2 df1.write.format("libsvm").save("data/foo") Failed to load class for data source: libsvm I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and

convert dataframe to libsvm format

阅读更多关于 convert dataframe to libsvm format

convert dataframe to libsvm format

阅读更多关于 convert dataframe to libsvm format

convert dataframe to libsvm format

阅读更多关于 convert dataframe to libsvm format