apache-spark-ml | 易学教程

Convert Sparse Vector to Dense Vector in Pyspark

阅读更多关于 Convert Sparse Vector to Dense Vector in Pyspark

I have a sparse vector like this >>> countVectors.rdd.map(lambda vector: vector[1]).collect() [SparseVector(13, {0: 1.0, 2: 1.0, 3: 1.0, 6: 1.0, 8: 1.0, 9: 1.0, 10: 1.0, 12: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 2: 1.0, 4: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 3: 1.0, 4: 1.0, 7: 1.0}), SparseVector(13, {1: 1.0, 2: 1.0, 5: 1.0, 11: 1.0})] I am trying to convert this into dense vector in pyspark 2.0.0 like this >>> frequencyVectors = countVectors.rdd.map(lambda vector: vector[1]) >>> frequencyVectors.map(lambda vector: Vectors.dense(vector)).collect() I am getting an error like this: 16/12

How can I train a random forest with a sparse matrix in Spark?

阅读更多关于 How can I train a random forest with a sparse matrix in Spark?

Consider this simple example that uses sparklyr : library(sparklyr) library(janeaustenr) # to get some text data library(stringr) library(dplyr) mytext <- austen_books() %>% mutate(label = as.integer(str_detect(text, 'great'))) #create a fake label variable mytext_spark <- copy_to(sc, mytext, name = 'mytext_spark', overwrite = TRUE) # Source: table<mytext_spark> [?? x 3] # Database: spark_connection text book label <chr> <chr> <int> 1 SENSE AND SENSIBILITY Sense & Sensibility 0 2 "" Sense & Sensibility 0 3 by Jane Austen Sense & Sensibility 0 4 "" Sense & Sensibility 0 5 (1811) Sense &

Spark ML Pipeline Logistic Regression Produces Much Worse Predictions Than R GLM

阅读更多关于 Spark ML Pipeline Logistic Regression Produces Much Worse Predictions Than R GLM

问题 I used ML PipeLine to run logistic regression models but for some reasons I got worst results than R. I have done some researches and the only post that I found that is related to this issue is this . It seems that Spark Logistic Regression returns models that minimize loss function while R glm function uses maximum likelihood. The Spark model only got 71.3% of the records right while R can predict 95.55% of the cases correctly. I was wondering if I did something wrong on the set up and if

Field “features” does not exist. SparkML

阅读更多关于 Field “features” does not exist. SparkML

I am trying to build a model in Spark ML with Zeppelin. I am new to this area and would like some help. I think i need to set the correct datatypes to the column and set the first column as the label. Any help would be greatly appreciated, thank you val training = sc.textFile("hdfs:///ford/fordTrain.csv") val header = training.first val inferSchema = true val df = training.toDF val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8) val lrModel = lr.fit(df) // Print the coefficients and intercept for multinomial logistic regression println(s"Coefficients:

How to overwrite Spark ML model in PySpark?

阅读更多关于 How to overwrite Spark ML model in PySpark?

问题 from pyspark.ml.regression import RandomForestRegressionModel rf = RandomForestRegressor(labelCol="label",featuresCol="features", numTrees=5, maxDepth=10, seed=42) rf_model = rf.fit(train_df) rf_model_path = "./hdfsData/" + "rfr_model" rf_model.save(rf_model_path) When I first tried to save the model, these lines worked. But when I want to save the model into the path again, it gave this error: Py4JJavaError: An error occurred while calling o1695.save. : java.io.IOException: Path ./hdfsData

Dealing with dynamic columns with VectorAssembler

阅读更多关于 Dealing with dynamic columns with VectorAssembler

Using sparks vector assembler the columns to be assembled need to be defined up front. However, if using the vector-assembler in a pipeline where the previous steps will modify the columns of the data frame how can I specify the columns without hard coding all the value manually? As df.columns will not contain the right values when the constructor is called of vector-assembler currently I do not see another way to handle that or to split the pipeline - which is bad as well because CrossValidator will no longer properly work. val vectorAssembler = new VectorAssembler() .setInputCols(df.columns

PCA in Spark MLlib and Spark ML

阅读更多关于 PCA in Spark MLlib and Spark ML

问题 Spark now has two machine learning libraries - Spark MLlib and Spark ML. They do somewhat overlap in what is implemented, but as I understand (as a person new to the whole Spark ecosystem) Spark ML is the way to go and MLlib is still around mostly for backward compatibility. My question is very concrete and related to PCA. In MLlib implementation there seems to be a limitation of the number of columns spark.mllib supports PCA for tall-and-skinny matrices stored in row-oriented format and any

How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark

阅读更多关于 How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark

Supossed I have a Pipeline like this: val tokenizer = new Tokenizer().setInputCol("tweet").setOutputCol("words") val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol("words").setOutputCol("features") val idf = new IDF().setInputCol("features").setOutputCol("idffeatures") val nb = new org.apache.spark.ml.classification.NaiveBayes() val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, idf, nb)) val paramGrid = new ParamGridBuilder().addGrid(hashingTF.numFeatures, Array(10, 100, 1000)).addGrid(nb.smoothing, Array(0.01, 0.1, 1)).build() val cv = new CrossValidator()

How to use RandomForest in Spark Pipeline

阅读更多关于 How to use RandomForest in Spark Pipeline

问题 I want to tunning my model with grid search and cross validation with spark. In the spark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model, which can be new as an object. However, the RandomForest model cannot be new by client code, so it seems not be able to use RandomForest in the pipeline api. I don't want to recreate an wheel, so can anybody give some advice? Thanks 回答1: However, the RandomForest model cannot be new by

Serve real-time predictions with trained Spark ML model [duplicate]

阅读更多关于 Serve real-time predictions with trained Spark ML model [duplicate]

问题 This question already has answers here : How to serve a Spark MLlib model? (4 answers) Closed 2 years ago . We are currently testing a prediction engine based on Spark's implementation of LDA in Python: https://spark.apache.org/docs/2.2.0/ml-clustering.html#latent-dirichlet-allocation-lda https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.clustering.LDA (we are using the pyspark.ml package, not pyspark.mllib) We were able to succesfuly train a model on a Spark cluster