apache-spark-ml

How to prepare for training data in mllib

喜欢而已 提交于 2019-12-21 13:48:12
问题 TL;DR; How do I use mllib to train my wiki data (text & category) for prediction against tweets? I have trouble figuring out how to convert my tokenized wiki data so that it can be trained through either NaiveBayes or LogisticRegression . My goal is to use the trained model for comparison against tweets*. I've tried using pipelines with LR and HashingTF with IDF for NaiveBayes but I keep getting wrong predictions. Here's what I've tried: *Note that I would like to use the many categories in

Spark ML Pipeline with RandomForest takes too long on 20MB dataset

岁酱吖の 提交于 2019-12-20 20:16:46
问题 I am using Spark ML to run some ML experiments, and on a small dataset of 20MB (Poker dataset) and a Random Forest with parameter grid, it takes 1h and 30 minutes to finish. Similarly with scikit-learn it takes much much less. In terms of environment, I was testing with 2 slaves, 15GB memory each, 24 cores. I assume it was not supposed to take that long and I am wondering if the problem lies within my code, since I am fairly new to Spark. Here it is: df = pd.read_csv(http://archive.ics.uci

What do columns ‘rawPrediction’ and ‘probability’ of DataFrame mean in Spark MLlib?

£可爱£侵袭症+ 提交于 2019-12-20 12:23:37
问题 After I trained a LogisticRegressionModel, I transformed the test data DF with it and get the prediction DF. And then when I call prediction.show(), the output column names are: [label | features | rawPrediction | probability | prediction] . I know what label and featrues mean, but how should I understand rawPrediction|probability|prediction ? 回答1: RawPrediction is typically the direct probability/confidence calculation. From Spark docs: Raw prediction for each possible label. The meaning of

Caching intermediate results in Spark ML pipeline

好久不见. 提交于 2019-12-20 12:22:53
问题 Lately I'm planning to migrate my standalone python ML code to spark. The ML pipeline in spark.ml turns out quite handy, with streamlined API for chaining up algorithm stages and hyper-parameter grid search. Still, I found its support for one important feature obscure in existing documents: caching of intermediate results . The importance of this feature arise when the pipeline involves computation intensive stages. For example, in my case I use a huge sparse matrix to perform multiple moving

How to get classification probabilities from MultilayerPerceptronClassifier?

痴心易碎 提交于 2019-12-20 01:43:38
问题 This seems most related to: How to get the probability per instance in classifications models in spark.mllib I'm doing a classification task with spark ml, building a MultilayerPerceptronClassifier. Once I build a model, I can get a predicted class given an input vector, but I can't get the probability for each output class. The above listing indicates that NaiveBayesModel supports this functionality as of Spark 1.5.0 (using a predictProbabilities method). I would like to get at this

Spark DataFrame handing empty String in OneHotEncoder

社会主义新天地 提交于 2019-12-19 17:45:31
问题 I am importing a CSV file (using spark-csv) into a DataFrame which has empty String values. When applied the OneHotEncoder , the application crashes with error requirement failed: Cannot have an empty string for name. . Is there a way I can get around this? I could reproduce the error in the example provided on Spark ml page: val df = sqlContext.createDataFrame(Seq( (0, "a"), (1, "b"), (2, "c"), (3, ""), //<- original example has "a" here (4, "a"), (5, "c") )).toDF("id", "category") val

How to get best params after tuning by pyspark.ml.tuning.TrainValidationSplit?

偶尔善良 提交于 2019-12-19 10:17:11
问题 I'm trying to tune the hyper-parameters of a Spark (PySpark) ALS model by TrainValidationSplit . It works well, but I want to know which combination of hyper-parameters is the best. How to get best params after evaluation ? from pyspark.ml.recommendation import ALS from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder from pyspark.ml.evaluation import RegressionEvaluator df = sqlCtx.createDataFrame( [(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 2, 5.0)],

Why spark.ml don't implement any of spark.mllib algorithms?

冷暖自知 提交于 2019-12-18 19:01:45
问题 Following the Spark MLlib Guide we can read that Spark has two machine learning libraries: spark.mllib , built on top of RDDs. spark.ml , built on top of Dataframes. According to this and this question on StackOverflow, Dataframes are better (and newer) than RDDs and should be used whenever possible. The problem is that I want to use common machine learning algorithms (e.g: Frequent Pattern Mining,Naive Bayes, etc.) and spark.ml (for dataframes) don't provide such methods, only spark.mllib

How do I convert an RDD with a SparseVector Column to a DataFrame with a column as Vector

六月ゝ 毕业季﹏ 提交于 2019-12-18 11:29:38
问题 I have an RDD with a tuple of values (String, SparseVector) and I want to create a DataFrame using the RDD . To get a (label:string, features:vector) DataFrame which is the Schema required by most of the ml algorithm's libraries. I know it can be done because HashingTF ml Library outputs a vector when given a features column of a DataFrame . temp_df = sqlContext.createDataFrame(temp_rdd, StructType([ StructField("label", DoubleType(), False), StructField("tokens", ArrayType(StringType()),

Spark: OneHot encoder and storing Pipeline (feature dimension issue)

穿精又带淫゛_ 提交于 2019-12-18 09:24:43
问题 We have a pipeline (2.0.1) consisting of multiple feature transformation stages. Some of these stages are OneHot encoders. Idea: classify an integer-based category into n independent features. When training the pipeline model and using it to predict all works fine. However, storing the trained pipeline model and reloading it causes issues: The stored 'trained' OneHot encoder does not keep track of how many categories there are. Loading it now causes issues: When the loaded model is used to