apache-spark-mllib

How to prepare for training data in mllib

匆匆过客 提交于 2019-12-04 07:16:36
TL;DR; How do I use mllib to train my wiki data (text & category) for prediction against tweets? I have trouble figuring out how to convert my tokenized wiki data so that it can be trained through either NaiveBayes or LogisticRegression . My goal is to use the trained model for comparison against tweets*. I've tried using pipelines with LR and HashingTF with IDF for NaiveBayes but I keep getting wrong predictions. Here's what I've tried: *Note that I would like to use the many categories in the wiki data for my labels...I've only seen binary classification (it's one category or another)....is

How to overwrite Spark ML model in PySpark?

有些话、适合烂在心里 提交于 2019-12-04 06:46:41
from pyspark.ml.regression import RandomForestRegressionModel rf = RandomForestRegressor(labelCol="label",featuresCol="features", numTrees=5, maxDepth=10, seed=42) rf_model = rf.fit(train_df) rf_model_path = "./hdfsData/" + "rfr_model" rf_model.save(rf_model_path) When I first tried to save the model, these lines worked. But when I want to save the model into the path again, it gave this error: Py4JJavaError: An error occurred while calling o1695.save. : java.io.IOException: Path ./hdfsData/rfr_model already exists. Please use write.overwrite().save(path) to overwrite it. Then I tried: rf

Spark mllib : how to convert string categorical features into int for Rating to accept

被刻印的时光 ゝ 提交于 2019-12-04 05:33:39
问题 I want to build a recommendation application using spark mllib and the ALS algorithm in collaborative filtering technique. My data set has the user and product features in string form like : [{"user":"StringName1", "product":"StringProduct1", "rating":1}, {"user":"StringName2", "product":"StringProduct2", "rating":2}, {"user":"StringName1", "product":"StringProduct2", "rating":3},..] But the Rating method seems to accept only int values for both user and product features. Does that mean I

SPARK, ML, Tuning, CrossValidator: access the metrics

大城市里の小女人 提交于 2019-12-04 03:58:46
In order to build a NaiveBayes multiclass classifier, I am using a CrossValidator to select the best parameters in my pipeline: val cv = new CrossValidator() .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid) .setEvaluator(new MulticlassClassificationEvaluator) .setNumFolds(10) val cvModel = cv.fit(trainingSet) The pipeline contains usual transformers and estimators in the following order: Tokenizer, StopWordsRemover, HashingTF, IDF and finally the NaiveBayes. Is it possible to access the metrics calculated for best model? Ideally, I would like to access the metrics of all models to see

Are random seeds compatible between systems?

五迷三道 提交于 2019-12-04 03:31:38
问题 I made a random forest model using python's sklearn package where I set the seed to for example to 1234 . To productionise models, we use pyspark. If I was to pass the same hyperparmeters and same seed value, i.e. 1234 , will it get the same results? Basically, do random seed numbers work between different systems? 回答1: Well, this is exactly the kind of question that could really do with some experiments & code snippets provided... Anyway, it seems that the general answer is a firm no : not

How to convert org.apache.spark.rdd.RDD[Array[Double]] to Array[Double] which is required by Spark MLlib

时间秒杀一切 提交于 2019-12-04 02:59:47
I am trying to implement KMeans using Apache Spark . val data = sc.textFile(irisDatasetString) val parsedData = data.map(_.split(',').map(_.toDouble)).cache() val clusters = KMeans.train(parsedData,3,numIterations = 20) on which I get the following error : error: overloaded method value train with alternatives: (data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],k: Int,maxIterations: Int,runs: Int)org.apache.spark.mllib.clustering.KMeansModel <and> (data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],k: Int,maxIterations: Int)org.apache.spark.mllib.clustering

How do I run the Spark decision tree with a categorical feature set using Scala?

喜夏-厌秋 提交于 2019-12-04 02:57:19
I have a feature set with a corresponding categoricalFeaturesInfo: Map[Int,Int]. However, for the life of me I cannot figure out how I am supposed to get the DecisionTree class to work. It will not accept anything, but a LabeledPoint as data. However, LabeledPoint requires (double, vector) where the vector requires doubles. val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail))) // Run training algorithm to build the model val maxDepth: Int = 3 val isMulticlassWithCategoricalFeatures: Boolean = true val numClassesForClassification: Int = countPossibilities(labelCol)

From DataFrame to RDD[LabeledPoint]

喜你入骨 提交于 2019-12-03 23:21:03
I am trying to implement a document classifier using Apache Spark MLlib and I am having some problems representing the data. My code is the following: import org.apache.spark.sql.{Row, SQLContext} import org.apache.spark.sql.types.{StringType, StructField, StructType} import org.apache.spark.ml.feature.Tokenizer import org.apache.spark.ml.feature.HashingTF import org.apache.spark.ml.feature.IDF val sql = new SQLContext(sc) // Load raw data from a TSV file val raw = sc.textFile("data.tsv").map(_.split("\t").toSeq) // Convert the RDD to a dataframe val schema = StructType(List(StructField("class

apache spark MLLib: how to build labeled points for string features?

耗尽温柔 提交于 2019-12-03 22:52:12
I am trying to build a NaiveBayes classifier with Spark's MLLib which takes as input a set of documents. I'd like to put some things as features (i.e. authors, explicit tags, implicit keywords, category), but looking at the documentation it seems that a LabeledPoint contains only doubles, i.e it looks like LabeledPoint[Double, List[Pair[Double,Double]] . Instead what I have as output from the rest of my code would be something like LabeledPoint[Double, List[Pair[String,Double]] . I could make up my own conversion, but it seems odd. How am I supposed to handle this using MLLib? I believe the

Non-integer ids in Spark MLlib ALS

房东的猫 提交于 2019-12-03 18:18:25
问题 I'd like to use val ratings = data.map(_.split(',') match { case Array(user,item,rate) => Rating(user.toInt,item.toInt,rate.toFloat) }) val model = ALS.train(ratings,rank,numIterations,alpha) However, the user data i get are stored as Long. When switched to int, it may produce error. How can i do to solve the problem? 回答1: You can use one of ML implementations which support Long labels. RDD version it is significantly less user friendly compared to other implementations: import org.apache