apache-spark-ml

Spark ML Kmeans give : org.apache.spark.SparkException: Failed to execute user defined function($anonfun$2: (vector) => int)

别说谁变了你拦得住时间么 提交于 2019-12-11 05:03:59
问题 I try to load the KmeansModel and then get the label out of it : Here is the code that I have written : val kMeansModel = KMeansModel.load(trainedMlModel.mlModelFilePath) val arrayOfElements = measurePoint.measurements.map(a => a._2).toSeq println(s"ArrayOfELements::::$arrayOfElements") val arrayDF = sparkContext.parallelize(arrayOfElements).toDF() arrayDF.show() val vectorDF = new VectorAssembler().setInputCols(arrayDF.columns).setOutputCol("features").transform(arrayDF) vectorDF.printSchema

Sparklyr split string (to string)

柔情痞子 提交于 2019-12-11 02:40:19
问题 Trying to split a string in sparklyr then use it for joins/filtering I tried the suggested approach of tokenizing the string then separating it to new columns. Here is a reproducible example (note that I have to translate my NA that turns into a string "NA" after copy_to to actual NA, is there a way not having to do that) x <- data.frame(Id=c(1,2,3,4),A=c('A-B','A-C','A-D',NA)) df <- copy_to(sc,x,'df') df %>% mutate(A = ifelse(A=='NA',NA,A)) %>% ft_regex_tokenizer(input.col="A", output.col="B

ML model update in spark streaming

戏子无情 提交于 2019-12-11 00:56:37
问题 I have persisted machine learning model in hdfs via spark batch job and i am consuming this in my spark streaming. Basically, the ML model is broadcasted to all executors from the spark driver. Can some one suggest how i can update the model in real time without stopping the spark streaming job? Basically a new ML model will get created as and when more data points are available but not have any idea how the NEW model will need to be sent to the spark executors. Request to post some sample

PySpark reversing StringIndexer in nested array

烂漫一生 提交于 2019-12-10 22:14:03
问题 I'm using PySpark to do collaborative filtering using ALS. My original user and item id's are strings, so I used StringIndexer to convert them to numeric indices (PySpark's ALS model obliges us to do so). After I've fitted the model, I can get the top 3 recommendations for each user like so: recs = ( model .recommendForAllUsers(3) ) The recs dataframe looks like so: +-----------+--------------------+ |userIdIndex| recommendations| +-----------+--------------------+ | 1580|[[10096,3.6725707...

How to interpret probability column in spark logistic regression prediction?

夙愿已清 提交于 2019-12-10 20:28:14
问题 I'm getting predictions through spark.ml.classification.LogisticRegressionModel.predict . A number of the rows have the prediction column as 1.0 and probability column as .04 . The model.getThreshold is 0.5 so I'd assume the model is classifying everything over a 0.5 probability threshold as 1.0 . How am I supposed to interpret a result with a 1.0 prediction and a probability of 0.04? 回答1: The probability column from performing a LogisticRegression should contain a list with the same length

Add new fitted stage to a exitsting PipelineModel without fitting again

江枫思渺然 提交于 2019-12-10 19:16:28
问题 I would like to concatenate several trained Pipelines to one, which is similar to "Spark add new fitted stage to a exitsting PipelineModel without fitting again" however the solution as below is for PySpark. > pipe_model_new = PipelineModel(stages = [pipe_model , pipe_model2]) > final_df = pipe_model_new.transform(df1) In Apache Spark 2.0 "PipelineModel"'s constructor is marked as private, hence it can not be called outside. While in "Pipeline" class, only "fit" method creates "PipelineModel"

Spark ML gradient boosted trees not using all nodes

穿精又带淫゛_ 提交于 2019-12-10 18:09:03
问题 I'm using the Spark ML GBTClassifier in pyspark to train a binary classification model on a dataframe with ~400k rows and ~9k columns on an AWS EMR cluster. I'm comparing this against my current solution, which is running XGBoost on a huge EC2 that can fit the whole dataframe in memory. My hope was that I could train (and score new observations) much faster in Spark because it would be distributed/parallel. However, when watch my cluster (through ganglia) I see that only 3-4 nodes have active

How to split column of vectors into two columns?

≯℡__Kan透↙ 提交于 2019-12-10 15:32:11
问题 I use PySpark. Spark ML's Random Forest output DataFrame has a column "probability" which is a vector with two values. I just want to add two columns to the output DataFrame, "prob1" and "prob2", which correspond to the first and second values in the vector. I've tried the following: output2 = output.withColumn('prob1', output.map(lambda r: r['probability'][0])) but I get the error that 'col should be Column'. Any suggestions on how to transform a column of vectors into columns of its values?

pyspark: getting the best model's parameters after a gridsearch is blank {}

拈花ヽ惹草 提交于 2019-12-10 11:17:45
问题 could someone help me extract the best performing model's parameters from my grid search? It's a blank dictionary for some reason. from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit, CrossValidator from pyspark.ml.evaluation import BinaryClassificationEvaluator train, test = df.randomSplit([0.66, 0.34], seed=12345) paramGrid = (ParamGridBuilder() .addGrid(lr.regParam, [0.01,0.1]) .addGrid(lr.elasticNetParam, [1.0,]) .addGrid(lr.maxIter, [3,]) .build()) evaluator =

Handling NULL values in Spark StringIndexer

不想你离开。 提交于 2019-12-10 10:08:40
问题 I have a dataset with some categorical string columns and I want to represent them in double type. I used StringIndexer for this convertion and It works but when I tried it in another dataset that has NULL values it gave java.lang.NullPointerException error and did not work. For better understanding here is my code: for(col <- cols){ out_name = col ++ "_" var indexer = new StringIndexer().setInputCol(col).setOutputCol(out_name) var indexed = indexer.fit(df).transform(df) df = (indexed