apache-spark-ml

Convert Sparse Vector to Dense Vector in Pyspark

只愿长相守 提交于 2019-12-05 12:59:32
I have a sparse vector like this >>> countVectors.rdd.map(lambda vector: vector[1]).collect() [SparseVector(13, {0: 1.0, 2: 1.0, 3: 1.0, 6: 1.0, 8: 1.0, 9: 1.0, 10: 1.0, 12: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 2: 1.0, 4: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 3: 1.0, 4: 1.0, 7: 1.0}), SparseVector(13, {1: 1.0, 2: 1.0, 5: 1.0, 11: 1.0})] I am trying to convert this into dense vector in pyspark 2.0.0 like this >>> frequencyVectors = countVectors.rdd.map(lambda vector: vector[1]) >>> frequencyVectors.map(lambda vector: Vectors.dense(vector)).collect() I am getting an error like this: 16/12

How can I train a random forest with a sparse matrix in Spark?

此生再无相见时 提交于 2019-12-05 08:32:39
Consider this simple example that uses sparklyr : library(sparklyr) library(janeaustenr) # to get some text data library(stringr) library(dplyr) mytext <- austen_books() %>% mutate(label = as.integer(str_detect(text, 'great'))) #create a fake label variable mytext_spark <- copy_to(sc, mytext, name = 'mytext_spark', overwrite = TRUE) # Source: table<mytext_spark> [?? x 3] # Database: spark_connection text book label <chr> <chr> <int> 1 SENSE AND SENSIBILITY Sense & Sensibility 0 2 "" Sense & Sensibility 0 3 by Jane Austen Sense & Sensibility 0 4 "" Sense & Sensibility 0 5 (1811) Sense &

Spark ML Pipeline Logistic Regression Produces Much Worse Predictions Than R GLM

◇◆丶佛笑我妖孽 提交于 2019-12-05 06:14:45
问题 I used ML PipeLine to run logistic regression models but for some reasons I got worst results than R. I have done some researches and the only post that I found that is related to this issue is this . It seems that Spark Logistic Regression returns models that minimize loss function while R glm function uses maximum likelihood. The Spark model only got 71.3% of the records right while R can predict 95.55% of the cases correctly. I was wondering if I did something wrong on the set up and if

Field “features” does not exist. SparkML

…衆ロ難τιáo~ 提交于 2019-12-05 01:47:11
I am trying to build a model in Spark ML with Zeppelin. I am new to this area and would like some help. I think i need to set the correct datatypes to the column and set the first column as the label. Any help would be greatly appreciated, thank you val training = sc.textFile("hdfs:///ford/fordTrain.csv") val header = training.first val inferSchema = true val df = training.toDF val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8) val lrModel = lr.fit(df) // Print the coefficients and intercept for multinomial logistic regression println(s"Coefficients:

How to overwrite Spark ML model in PySpark?

大憨熊 提交于 2019-12-04 23:13:44
问题 from pyspark.ml.regression import RandomForestRegressionModel rf = RandomForestRegressor(labelCol="label",featuresCol="features", numTrees=5, maxDepth=10, seed=42) rf_model = rf.fit(train_df) rf_model_path = "./hdfsData/" + "rfr_model" rf_model.save(rf_model_path) When I first tried to save the model, these lines worked. But when I want to save the model into the path again, it gave this error: Py4JJavaError: An error occurred while calling o1695.save. : java.io.IOException: Path ./hdfsData

Dealing with dynamic columns with VectorAssembler

不问归期 提交于 2019-12-04 21:05:25
Using sparks vector assembler the columns to be assembled need to be defined up front. However, if using the vector-assembler in a pipeline where the previous steps will modify the columns of the data frame how can I specify the columns without hard coding all the value manually? As df.columns will not contain the right values when the constructor is called of vector-assembler currently I do not see another way to handle that or to split the pipeline - which is bad as well because CrossValidator will no longer properly work. val vectorAssembler = new VectorAssembler() .setInputCols(df.columns

PCA in Spark MLlib and Spark ML

笑着哭i 提交于 2019-12-04 18:39:22
问题 Spark now has two machine learning libraries - Spark MLlib and Spark ML. They do somewhat overlap in what is implemented, but as I understand (as a person new to the whole Spark ecosystem) Spark ML is the way to go and MLlib is still around mostly for backward compatibility. My question is very concrete and related to PCA. In MLlib implementation there seems to be a limitation of the number of columns spark.mllib supports PCA for tall-and-skinny matrices stored in row-oriented format and any

How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark

二次信任 提交于 2019-12-04 18:05:24
Supossed I have a Pipeline like this: val tokenizer = new Tokenizer().setInputCol("tweet").setOutputCol("words") val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol("words").setOutputCol("features") val idf = new IDF().setInputCol("features").setOutputCol("idffeatures") val nb = new org.apache.spark.ml.classification.NaiveBayes() val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, idf, nb)) val paramGrid = new ParamGridBuilder().addGrid(hashingTF.numFeatures, Array(10, 100, 1000)).addGrid(nb.smoothing, Array(0.01, 0.1, 1)).build() val cv = new CrossValidator()

How to use RandomForest in Spark Pipeline

对着背影说爱祢 提交于 2019-12-04 13:16:26
问题 I want to tunning my model with grid search and cross validation with spark. In the spark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model, which can be new as an object. However, the RandomForest model cannot be new by client code, so it seems not be able to use RandomForest in the pipeline api. I don't want to recreate an wheel, so can anybody give some advice? Thanks 回答1: However, the RandomForest model cannot be new by

Serve real-time predictions with trained Spark ML model [duplicate]

时间秒杀一切 提交于 2019-12-04 12:06:16
问题 This question already has answers here : How to serve a Spark MLlib model? (4 answers) Closed 2 years ago . We are currently testing a prediction engine based on Spark's implementation of LDA in Python: https://spark.apache.org/docs/2.2.0/ml-clustering.html#latent-dirichlet-allocation-lda https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.clustering.LDA (we are using the pyspark.ml package, not pyspark.mllib) We were able to succesfuly train a model on a Spark cluster