apache-spark-mllib

How to prepare for training data in mllib

喜欢而已 提交于 2019-12-21 13:48:12
问题 TL;DR; How do I use mllib to train my wiki data (text & category) for prediction against tweets? I have trouble figuring out how to convert my tokenized wiki data so that it can be trained through either NaiveBayes or LogisticRegression . My goal is to use the trained model for comparison against tweets*. I've tried using pipelines with LR and HashingTF with IDF for NaiveBayes but I keep getting wrong predictions. Here's what I've tried: *Note that I would like to use the many categories in

Apache Spark ALS - how to perform Live Recommendations / fold-in anonym user

送分小仙女□ 提交于 2019-12-21 13:42:31
问题 I am using Apache Spark (Pyspark API for Python) ALS MLLIB to develop a service that performs live recommendations for anonym users (users not in the training set) in my site. In my usecase I train the model on the User ratings in this way: from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating ratings = df.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2]))) rank = 10 numIterations = 10 model = ALS.trainImplicit(ratings, rank, numIterations) Now, each time an

Spark MLlib - trainImplicit warning

匆匆过客 提交于 2019-12-21 03:34:13
问题 I keep seeing these warnings when using trainImplicit : WARN TaskSetManager: Stage 246 contains a task of very large size (208 KB). The maximum recommended task size is 100 KB. And then the task size starts to increase. I tried to call repartition on the input RDD but the warnings are the same. All these warnings come from ALS iterations, from flatMap and also from aggregate, for instance the origin of the stage where the flatMap is showing these warnings (w/ Spark 1.3.0, but they are also

Spark : regression model threshold and precision

妖精的绣舞 提交于 2019-12-21 02:35:38
问题 I have logistic regression mode, where I explicitly set the threshold to 0.5. model.setThreshold(0.5) I train the model and then I want to get basic stats -- precision, recall etc. This is what I do when I evaluate the model: val metrics = new BinaryClassificationMetrics(predictionAndLabels) val precision = metrics.precisionByThreshold precision.foreach { case (t, p) => println(s"Threshold is: $t, Precision is: $p") } I get results with only 0.0 and 1.0 as values of threshold and 0.5 is

Spark ML Pipeline with RandomForest takes too long on 20MB dataset

岁酱吖の 提交于 2019-12-20 20:16:46
问题 I am using Spark ML to run some ML experiments, and on a small dataset of 20MB (Poker dataset) and a Random Forest with parameter grid, it takes 1h and 30 minutes to finish. Similarly with scikit-learn it takes much much less. In terms of environment, I was testing with 2 slaves, 15GB memory each, 24 cores. I assume it was not supposed to take that long and I am wondering if the problem lies within my code, since I am fairly new to Spark. Here it is: df = pd.read_csv(http://archive.ics.uci

What is rank in ALS machine Learning Algorithm in Apache Spark Mllib

北战南征 提交于 2019-12-20 09:56:48
问题 I Wanted to try an example of ALS machine learning algorithm. And my code works fine, However I do not understand parameter rank used in algorithm. I have following code in java // Build the recommendation model using ALS int rank = 10; int numIterations = 10; MatrixFactorizationModel model = ALS.train(JavaRDD.toRDD(ratings), rank, numIterations, 0.01); I have read some where that it is the number of latent factors in the model. Suppose I have a dataset of (user,product,rating) that has 100

How to view Random Forest statistics in Spark (scala)

浪子不回头ぞ 提交于 2019-12-20 06:06:37
问题 I have a RandomForestClassifierModel in Spark. Using .toDebugString() outputs the following Tree 0 (weight 1.0): If (feature 0 in {1.0,2.0,3.0}) If (feature 3 in {2.0,3.0}) If (feature 8 <= 55.3) . . Else (feature 0 not in {1.0,2.0,3.0}) . . Tree 1 (weight 1.0): . . ...etc I'd like to view the actual data as it goes through the model, something like Tree 0 (weight 1.0): If (feature 0 in {1.0,2.0,3.0}) 60% If (feature 3 in {2.0,3.0}) 57% If (feature 8 <= 55.3) 22% . . Else (feature 0 not in {1

SparkR from Rstudio - gives Error in invokeJava(isStatic = TRUE, className, methodName, …) :

戏子无情 提交于 2019-12-20 05:00:10
问题 I am using RStudio. After creating session if i try to create dataframe using R data it gives error. Sys.setenv(SPARK_HOME = "E:/spark-2.0.0-bin-hadoop2.7/spark-2.0.0-bin-hadoop2.7") Sys.setenv(HADOOP_HOME = "E:/winutils") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) Sys.setenv('SPARKR_SUBMIT_ARGS'='"sparkr-shell"') library(SparkR) sparkR.session(sparkConfig = list(spark.sql.warehouse.dir="C:/Temp")) localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c

StackOverflow-error when applying pyspark ALS's “recommendProductsForUsers” (although cluster of >300GB Ram available)

ぐ巨炮叔叔 提交于 2019-12-20 03:43:24
问题 Looking for expertise to guide me on issue below. Background: I'm trying to get going with a basic PySpark script inspired on this example As deploy infrastructure I use a Google Cloud Dataproc Cluster. Cornerstone in my code is the function "recommendProductsForUsers" documented here which gives me back the top X products for all users in the model Issue I incur The ALS.Train script runs smoothly and scales well on GCP (Easily >1mn customers). However, applying the predictions: i.e. using

Spark DataFrame handing empty String in OneHotEncoder

社会主义新天地 提交于 2019-12-19 17:45:31
问题 I am importing a CSV file (using spark-csv) into a DataFrame which has empty String values. When applied the OneHotEncoder , the application crashes with error requirement failed: Cannot have an empty string for name. . Is there a way I can get around this? I could reproduce the error in the example provided on Spark ml page: val df = sqlContext.createDataFrame(Seq( (0, "a"), (1, "b"), (2, "c"), (3, ""), //<- original example has "a" here (4, "a"), (5, "c") )).toDF("id", "category") val