apache-spark-mllib | 易学教程

How to prepare for training data in mllib

阅读更多关于 How to prepare for training data in mllib

问题 TL;DR; How do I use mllib to train my wiki data (text & category) for prediction against tweets? I have trouble figuring out how to convert my tokenized wiki data so that it can be trained through either NaiveBayes or LogisticRegression . My goal is to use the trained model for comparison against tweets*. I've tried using pipelines with LR and HashingTF with IDF for NaiveBayes but I keep getting wrong predictions. Here's what I've tried: *Note that I would like to use the many categories in

Apache Spark ALS - how to perform Live Recommendations / fold-in anonym user

阅读更多关于 Apache Spark ALS - how to perform Live Recommendations / fold-in anonym user

问题 I am using Apache Spark (Pyspark API for Python) ALS MLLIB to develop a service that performs live recommendations for anonym users (users not in the training set) in my site. In my usecase I train the model on the User ratings in this way: from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating ratings = df.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2]))) rank = 10 numIterations = 10 model = ALS.trainImplicit(ratings, rank, numIterations) Now, each time an

Spark MLlib - trainImplicit warning

阅读更多关于 Spark MLlib - trainImplicit warning

问题 I keep seeing these warnings when using trainImplicit : WARN TaskSetManager: Stage 246 contains a task of very large size (208 KB). The maximum recommended task size is 100 KB. And then the task size starts to increase. I tried to call repartition on the input RDD but the warnings are the same. All these warnings come from ALS iterations, from flatMap and also from aggregate, for instance the origin of the stage where the flatMap is showing these warnings (w/ Spark 1.3.0, but they are also

Spark : regression model threshold and precision

阅读更多关于 Spark : regression model threshold and precision

问题 I have logistic regression mode, where I explicitly set the threshold to 0.5. model.setThreshold(0.5) I train the model and then I want to get basic stats -- precision, recall etc. This is what I do when I evaluate the model: val metrics = new BinaryClassificationMetrics(predictionAndLabels) val precision = metrics.precisionByThreshold precision.foreach { case (t, p) => println(s"Threshold is: $t, Precision is: $p") } I get results with only 0.0 and 1.0 as values of threshold and 0.5 is

Spark ML Pipeline with RandomForest takes too long on 20MB dataset

阅读更多关于 Spark ML Pipeline with RandomForest takes too long on 20MB dataset

问题 I am using Spark ML to run some ML experiments, and on a small dataset of 20MB (Poker dataset) and a Random Forest with parameter grid, it takes 1h and 30 minutes to finish. Similarly with scikit-learn it takes much much less. In terms of environment, I was testing with 2 slaves, 15GB memory each, 24 cores. I assume it was not supposed to take that long and I am wondering if the problem lies within my code, since I am fairly new to Spark. Here it is: df = pd.read_csv(http://archive.ics.uci

What is rank in ALS machine Learning Algorithm in Apache Spark Mllib

阅读更多关于 What is rank in ALS machine Learning Algorithm in Apache Spark Mllib

问题 I Wanted to try an example of ALS machine learning algorithm. And my code works fine, However I do not understand parameter rank used in algorithm. I have following code in java // Build the recommendation model using ALS int rank = 10; int numIterations = 10; MatrixFactorizationModel model = ALS.train(JavaRDD.toRDD(ratings), rank, numIterations, 0.01); I have read some where that it is the number of latent factors in the model. Suppose I have a dataset of (user,product,rating) that has 100

How to view Random Forest statistics in Spark (scala)

阅读更多关于 How to view Random Forest statistics in Spark (scala)

问题 I have a RandomForestClassifierModel in Spark. Using .toDebugString() outputs the following Tree 0 (weight 1.0): If (feature 0 in {1.0,2.0,3.0}) If (feature 3 in {2.0,3.0}) If (feature 8 <= 55.3) . . Else (feature 0 not in {1.0,2.0,3.0}) . . Tree 1 (weight 1.0): . . ...etc I'd like to view the actual data as it goes through the model, something like Tree 0 (weight 1.0): If (feature 0 in {1.0,2.0,3.0}) 60% If (feature 3 in {2.0,3.0}) 57% If (feature 8 <= 55.3) 22% . . Else (feature 0 not in {1

SparkR from Rstudio - gives Error in invokeJava(isStatic = TRUE, className, methodName, …) :

阅读更多关于 SparkR from Rstudio - gives Error in invokeJava(isStatic = TRUE, className, methodName, …) :

问题 I am using RStudio. After creating session if i try to create dataframe using R data it gives error. Sys.setenv(SPARK_HOME = "E:/spark-2.0.0-bin-hadoop2.7/spark-2.0.0-bin-hadoop2.7") Sys.setenv(HADOOP_HOME = "E:/winutils") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) Sys.setenv('SPARKR_SUBMIT_ARGS'='"sparkr-shell"') library(SparkR) sparkR.session(sparkConfig = list(spark.sql.warehouse.dir="C:/Temp")) localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c

StackOverflow-error when applying pyspark ALS's “recommendProductsForUsers” (although cluster of >300GB Ram available)

阅读更多关于 StackOverflow-error when applying pyspark ALS's “recommendProductsForUsers” (although cluster of >300GB Ram available)

问题 Looking for expertise to guide me on issue below. Background: I'm trying to get going with a basic PySpark script inspired on this example As deploy infrastructure I use a Google Cloud Dataproc Cluster. Cornerstone in my code is the function "recommendProductsForUsers" documented here which gives me back the top X products for all users in the model Issue I incur The ALS.Train script runs smoothly and scales well on GCP (Easily >1mn customers). However, applying the predictions: i.e. using

Spark DataFrame handing empty String in OneHotEncoder

阅读更多关于 Spark DataFrame handing empty String in OneHotEncoder

问题 I am importing a CSV file (using spark-csv) into a DataFrame which has empty String values. When applied the OneHotEncoder , the application crashes with error requirement failed: Cannot have an empty string for name. . Is there a way I can get around this? I could reproduce the error in the example provided on Spark ml page: val df = sqlContext.createDataFrame(Seq( (0, "a"), (1, "b"), (2, "c"), (3, ""), //<- original example has "a" here (4, "a"), (5, "c") )).toDF("id", "category") val