apache-spark-ml

How to combine n-grams into one vocabulary in Spark?

限于喜欢 提交于 2019-12-04 12:00:19
问题 Wondering if there is a built-in Spark feature to combine 1-, 2-, n-gram features into a single vocabulary. Setting n=2 in NGram followed by invocation of CountVectorizer results in a dictionary containing only 2-grams. What I really want is to combine all frequent 1-grams, 2-grams, etc into one dictionary for my corpus. 回答1: You can train separate NGram and CountVectorizer models and merge using VectorAssembler . from pyspark.ml.feature import NGram, CountVectorizer, VectorAssembler from

Checkpoint RDD ReliableCheckpointRDD has different number of partitions from original RDD

痴心易碎 提交于 2019-12-04 11:36:43
I have a spark cluster of two machines and I when I run a spark streaming application I get the following errors : Exception in thread "main" org.apache.spark.SparkException: Checkpoint RDD ReliableCheckpointRDD[11] at print at StatefulNetworkWordCount.scala:78(1) has different number of partitions from original RDD MapPartitionsRDD[10] at updateStateByKey at StatefulNetworkWordCount.scala:76(2) at org.apache.spark.rdd.ReliableRDDCheckpointData.doCheckpoint(ReliableRDDCheckpointData.scala:73) at org.apache.spark.rdd.RDDCheckpointData.checkpoint(RDDCheckpointData.scala:74) How can I give a

How to prepare for training data in mllib

匆匆过客 提交于 2019-12-04 07:16:36
TL;DR; How do I use mllib to train my wiki data (text & category) for prediction against tweets? I have trouble figuring out how to convert my tokenized wiki data so that it can be trained through either NaiveBayes or LogisticRegression . My goal is to use the trained model for comparison against tweets*. I've tried using pipelines with LR and HashingTF with IDF for NaiveBayes but I keep getting wrong predictions. Here's what I've tried: *Note that I would like to use the many categories in the wiki data for my labels...I've only seen binary classification (it's one category or another)....is

How to overwrite Spark ML model in PySpark?

有些话、适合烂在心里 提交于 2019-12-04 06:46:41
from pyspark.ml.regression import RandomForestRegressionModel rf = RandomForestRegressor(labelCol="label",featuresCol="features", numTrees=5, maxDepth=10, seed=42) rf_model = rf.fit(train_df) rf_model_path = "./hdfsData/" + "rfr_model" rf_model.save(rf_model_path) When I first tried to save the model, these lines worked. But when I want to save the model into the path again, it gave this error: Py4JJavaError: An error occurred while calling o1695.save. : java.io.IOException: Path ./hdfsData/rfr_model already exists. Please use write.overwrite().save(path) to overwrite it. Then I tried: rf

SPARK, ML, Tuning, CrossValidator: access the metrics

大城市里の小女人 提交于 2019-12-04 03:58:46
In order to build a NaiveBayes multiclass classifier, I am using a CrossValidator to select the best parameters in my pipeline: val cv = new CrossValidator() .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid) .setEvaluator(new MulticlassClassificationEvaluator) .setNumFolds(10) val cvModel = cv.fit(trainingSet) The pipeline contains usual transformers and estimators in the following order: Tokenizer, StopWordsRemover, HashingTF, IDF and finally the NaiveBayes. Is it possible to access the metrics calculated for best model? Ideally, I would like to access the metrics of all models to see

Spark ML Pipeline Logistic Regression Produces Much Worse Predictions Than R GLM

╄→гoц情女王★ 提交于 2019-12-03 22:13:57
I used ML PipeLine to run logistic regression models but for some reasons I got worst results than R. I have done some researches and the only post that I found that is related to this issue is this . It seems that Spark Logistic Regression returns models that minimize loss function while R glm function uses maximum likelihood. The Spark model only got 71.3% of the records right while R can predict 95.55% of the cases correctly. I was wondering if I did something wrong on the set up and if there's a way to improve the prediction. The below is my Spark code and R code- Spark code partial model

Pyspark - Get all parameters of models created with ParamGridBuilder

走远了吗. 提交于 2019-12-03 20:19:28
问题 I'm using PySpark 2.0 for a Kaggle competition. I'd like to know the behavior of a model ( RandomForest ) depending on different parameters. ParamGridBuilder() allows to specify different values for a single parameters, and then perform (I guess) a Cartesian product of the entire set of parameters. Assuming my DataFrame is already defined: rdc = RandomForestClassifier() pipeline = Pipeline(stages=STAGES + [rdc]) paramGrid = ParamGridBuilder().addGrid(rdc.maxDepth, [3, 10, 20]) .addGrid(rdc

Cannot run RandomForestClassifier from spark ML on a simple example

两盒软妹~` 提交于 2019-12-03 17:04:46
I have tried to run the experimental RandomForestClassifier from the spark.ml package (version 1.5.2). The dataset I used is from the LogisticRegression example in the Spark ML guide . Here is the code: import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.param.ParamMap import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.sql.Row // Prepare training data from a list of (label, features) tuples. val training = sqlContext.createDataFrame(Seq( (1.0, Vectors.dense(0.0, 1.1, 0.1)), (0.0, Vectors.dense(2.0, 1.0, -1.0)), (0.0, Vectors.dense

PCA in Spark MLlib and Spark ML

浪子不回头ぞ 提交于 2019-12-03 11:56:43
Spark now has two machine learning libraries - Spark MLlib and Spark ML. They do somewhat overlap in what is implemented, but as I understand (as a person new to the whole Spark ecosystem) Spark ML is the way to go and MLlib is still around mostly for backward compatibility. My question is very concrete and related to PCA. In MLlib implementation there seems to be a limitation of the number of columns spark.mllib supports PCA for tall-and-skinny matrices stored in row-oriented format and any Vectors. Also, if you look at the Java code example there is also this The number of columns should be

Spark ML Pipeline with RandomForest takes too long on 20MB dataset

吃可爱长大的小学妹 提交于 2019-12-03 08:17:14
I am using Spark ML to run some ML experiments, and on a small dataset of 20MB ( Poker dataset ) and a Random Forest with parameter grid, it takes 1h and 30 minutes to finish. Similarly with scikit-learn it takes much much less. In terms of environment, I was testing with 2 slaves, 15GB memory each, 24 cores. I assume it was not supposed to take that long and I am wondering if the problem lies within my code, since I am fairly new to Spark. Here it is: df = pd.read_csv(http://archive.ics.uci.edu/ml/machine-learning-databases/poker/poker-hand-testing.data) dataframe = sqlContext.createDataFrame