apache-spark-mllib | 易学教程

Spark KMeans clustering: get the number of sample assigned to a cluster

阅读更多关于 Spark KMeans clustering: get the number of sample assigned to a cluster

问题 I am using Spark Mlib for kmeans clustering. I have a set of vectors from which I want to determine the most likely cluster center. So I will run kmeans clustering training on this set and select the cluster with the highest number of vector assigned to it. Therefore I need to know the number of vectors assigned to each cluster after training (i.e KMeans.run(...)). But I can not find a way to retrieve this information from KMeanModel result. I probably need to run predict on all training

Converting RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector]

阅读更多关于 Converting RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector]

问题 I am relatively new to Spark and Scala. I am starting with the following dataframe (single column made out of a dense Vector of Doubles): scala> val scaledDataOnly_pruned = scaledDataOnly.select("features") scaledDataOnly_pruned: org.apache.spark.sql.DataFrame = [features: vector] scala> scaledDataOnly_pruned.show(5) +--------------------+ | features| +--------------------+ |[-0.0948337274182...| |[-0.0948337274182...| |[-0.0948337274182...| |[-0.0948337274182...| |[-0.0948337274182...| +----

How to initialize cluster centers for K-means in Spark MLlib?

阅读更多关于 How to initialize cluster centers for K-means in Spark MLlib?

问题 Is there a way to initialize cluster centers while running K-Means in Spark MLlib? I tried following : model = KMeans.train( sc.parallelize(data), 3, maxIterations=0, initialModel = KMeansModel([(-1000.0,-1000.0),(5.0,5.0),(1000.0,1000.0)])) initialModel and setInitialModel are not present in spark-mllib_2.10 回答1: Initial model can set in Scala since Spark 1.5+ using setInitialModel which takes KMeansModel : import org.apache.spark.mllib.clustering.{KMeans, KMeansModel} import org.apache

Why spark.ml don't implement any of spark.mllib algorithms?

阅读更多关于 Why spark.ml don't implement any of spark.mllib algorithms?

问题 Following the Spark MLlib Guide we can read that Spark has two machine learning libraries: spark.mllib , built on top of RDDs. spark.ml , built on top of Dataframes. According to this and this question on StackOverflow, Dataframes are better (and newer) than RDDs and should be used whenever possible. The problem is that I want to use common machine learning algorithms (e.g: Frequent Pattern Mining,Naive Bayes, etc.) and spark.ml (for dataframes) don't provide such methods, only spark.mllib

run spark as java web application

阅读更多关于 run spark as java web application

问题 I have used Spark ML and was able to get reasonable accuracy in prediction for my business problem The data is not huge and I was able to transform the input ( basically a csv file ) using stanford NLP and run Naive Bayes for prediction in my local machine. I want to run this prediction service like a simple java main program or along with a simple MVC web application Currently I run my prediction using the spark-submit command ? Instead , can I create spark context and data frames from my

Run 3000+ Random Forest Models By Group Using Spark MLlib Scala API

阅读更多关于 Run 3000+ Random Forest Models By Group Using Spark MLlib Scala API

问题 I am trying to build random forest models by group(School_ID, more than 3 thousands) on a large model input csv file using Spark Scala API. Each of the group contains about 3000-4000 records. The resources I have at disposal are 20-30 aws m3.2xlarge instances. In R, I can construct models by group and save them to a list like this- library(dplyr);library(randomForest); Rf_model <- train %>% group_by(School_ID) %>% do(school= randomForest(formula=Rf_formula, data=., importance = TRUE)) The

How can I create a TF-IDF for Text Classification using Spark?

阅读更多关于 How can I create a TF-IDF for Text Classification using Spark?

问题 I have a CSV file with the following format : product_id1,product_title1 product_id2,product_title2 product_id3,product_title3 product_id4,product_title4 product_id5,product_title5 [...] The product_idX is a integer and the product_titleX is a String, example : 453478692, Apple iPhone 4 8Go I'm trying to create the TF-IDF from my file so I can use it for a Naive Bayes Classifier in MLlib. I am using Spark for Scala so far and using the tutorials I have found on the official page and the

How do I convert an RDD with a SparseVector Column to a DataFrame with a column as Vector

阅读更多关于 How do I convert an RDD with a SparseVector Column to a DataFrame with a column as Vector

问题 I have an RDD with a tuple of values (String, SparseVector) and I want to create a DataFrame using the RDD . To get a (label:string, features:vector) DataFrame which is the Schema required by most of the ml algorithm's libraries. I know it can be done because HashingTF ml Library outputs a vector when given a features column of a DataFrame . temp_df = sqlContext.createDataFrame(temp_rdd, StructType([ StructField("label", DoubleType(), False), StructField("tokens", ArrayType(StringType()),

Checkpointing In ALS Spark Scala

阅读更多关于 Checkpointing In ALS Spark Scala

问题 I just want to ask on the specifics how to successfully use checkpointInterval in Spark. And what do you mean by this comment in the code for ALS: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala If the checkpoint directory is not set in [[org.apache.spark.SparkContext]], * this setting is ignored. How can we set checkPoint directory? Can we use any hdfs-compatible directory for this? Is using setCheckpointInterval the correct

Difference between spark Vectors and scala immutable Vector?

阅读更多关于 Difference between spark Vectors and scala immutable Vector?

问题 I am writing a project for Spark 1.4 in Scala and am currently in between converting my initial input data into spark.mllib.linalg.Vectors and scala.immutable.Vector that I later want to work with in my algorithm. Could someone briefly explain the difference between the two and in what situation one would be more useful to use than the other? Thank you. 回答1: spark.mllib.linalg.Vector is designed for linear algebra applications. mllib provides two different implementations - DenseVector ,