apache-spark-ml | 易学教程

How to combine n-grams into one vocabulary in Spark?

阅读更多关于 How to combine n-grams into one vocabulary in Spark?

问题 Wondering if there is a built-in Spark feature to combine 1-, 2-, n-gram features into a single vocabulary. Setting n=2 in NGram followed by invocation of CountVectorizer results in a dictionary containing only 2-grams. What I really want is to combine all frequent 1-grams, 2-grams, etc into one dictionary for my corpus. 回答1: You can train separate NGram and CountVectorizer models and merge using VectorAssembler . from pyspark.ml.feature import NGram, CountVectorizer, VectorAssembler from

Checkpoint RDD ReliableCheckpointRDD has different number of partitions from original RDD

阅读更多关于 Checkpoint RDD ReliableCheckpointRDD has different number of partitions from original RDD

I have a spark cluster of two machines and I when I run a spark streaming application I get the following errors : Exception in thread "main" org.apache.spark.SparkException: Checkpoint RDD ReliableCheckpointRDD[11] at print at StatefulNetworkWordCount.scala:78(1) has different number of partitions from original RDD MapPartitionsRDD[10] at updateStateByKey at StatefulNetworkWordCount.scala:76(2) at org.apache.spark.rdd.ReliableRDDCheckpointData.doCheckpoint(ReliableRDDCheckpointData.scala:73) at org.apache.spark.rdd.RDDCheckpointData.checkpoint(RDDCheckpointData.scala:74) How can I give a

How to prepare for training data in mllib

阅读更多关于 How to prepare for training data in mllib

TL;DR; How do I use mllib to train my wiki data (text & category) for prediction against tweets? I have trouble figuring out how to convert my tokenized wiki data so that it can be trained through either NaiveBayes or LogisticRegression . My goal is to use the trained model for comparison against tweets*. I've tried using pipelines with LR and HashingTF with IDF for NaiveBayes but I keep getting wrong predictions. Here's what I've tried: *Note that I would like to use the many categories in the wiki data for my labels...I've only seen binary classification (it's one category or another)....is

How to overwrite Spark ML model in PySpark?

阅读更多关于 How to overwrite Spark ML model in PySpark?

from pyspark.ml.regression import RandomForestRegressionModel rf = RandomForestRegressor(labelCol="label",featuresCol="features", numTrees=5, maxDepth=10, seed=42) rf_model = rf.fit(train_df) rf_model_path = "./hdfsData/" + "rfr_model" rf_model.save(rf_model_path) When I first tried to save the model, these lines worked. But when I want to save the model into the path again, it gave this error: Py4JJavaError: An error occurred while calling o1695.save. : java.io.IOException: Path ./hdfsData/rfr_model already exists. Please use write.overwrite().save(path) to overwrite it. Then I tried: rf

SPARK, ML, Tuning, CrossValidator: access the metrics

阅读更多关于 SPARK, ML, Tuning, CrossValidator: access the metrics

In order to build a NaiveBayes multiclass classifier, I am using a CrossValidator to select the best parameters in my pipeline: val cv = new CrossValidator() .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid) .setEvaluator(new MulticlassClassificationEvaluator) .setNumFolds(10) val cvModel = cv.fit(trainingSet) The pipeline contains usual transformers and estimators in the following order: Tokenizer, StopWordsRemover, HashingTF, IDF and finally the NaiveBayes. Is it possible to access the metrics calculated for best model? Ideally, I would like to access the metrics of all models to see

Spark ML Pipeline Logistic Regression Produces Much Worse Predictions Than R GLM

阅读更多关于 Spark ML Pipeline Logistic Regression Produces Much Worse Predictions Than R GLM

I used ML PipeLine to run logistic regression models but for some reasons I got worst results than R. I have done some researches and the only post that I found that is related to this issue is this . It seems that Spark Logistic Regression returns models that minimize loss function while R glm function uses maximum likelihood. The Spark model only got 71.3% of the records right while R can predict 95.55% of the cases correctly. I was wondering if I did something wrong on the set up and if there's a way to improve the prediction. The below is my Spark code and R code- Spark code partial model

Pyspark - Get all parameters of models created with ParamGridBuilder

阅读更多关于 Pyspark - Get all parameters of models created with ParamGridBuilder

问题 I'm using PySpark 2.0 for a Kaggle competition. I'd like to know the behavior of a model ( RandomForest ) depending on different parameters. ParamGridBuilder() allows to specify different values for a single parameters, and then perform (I guess) a Cartesian product of the entire set of parameters. Assuming my DataFrame is already defined: rdc = RandomForestClassifier() pipeline = Pipeline(stages=STAGES + [rdc]) paramGrid = ParamGridBuilder().addGrid(rdc.maxDepth, [3, 10, 20]) .addGrid(rdc

Cannot run RandomForestClassifier from spark ML on a simple example

阅读更多关于 Cannot run RandomForestClassifier from spark ML on a simple example

I have tried to run the experimental RandomForestClassifier from the spark.ml package (version 1.5.2). The dataset I used is from the LogisticRegression example in the Spark ML guide . Here is the code: import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.param.ParamMap import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.sql.Row // Prepare training data from a list of (label, features) tuples. val training = sqlContext.createDataFrame(Seq( (1.0, Vectors.dense(0.0, 1.1, 0.1)), (0.0, Vectors.dense(2.0, 1.0, -1.0)), (0.0, Vectors.dense

PCA in Spark MLlib and Spark ML

阅读更多关于 PCA in Spark MLlib and Spark ML

Spark now has two machine learning libraries - Spark MLlib and Spark ML. They do somewhat overlap in what is implemented, but as I understand (as a person new to the whole Spark ecosystem) Spark ML is the way to go and MLlib is still around mostly for backward compatibility. My question is very concrete and related to PCA. In MLlib implementation there seems to be a limitation of the number of columns spark.mllib supports PCA for tall-and-skinny matrices stored in row-oriented format and any Vectors. Also, if you look at the Java code example there is also this The number of columns should be

Spark ML Pipeline with RandomForest takes too long on 20MB dataset

阅读更多关于 Spark ML Pipeline with RandomForest takes too long on 20MB dataset

I am using Spark ML to run some ML experiments, and on a small dataset of 20MB ( Poker dataset ) and a Random Forest with parameter grid, it takes 1h and 30 minutes to finish. Similarly with scikit-learn it takes much much less. In terms of environment, I was testing with 2 slaves, 15GB memory each, 24 cores. I assume it was not supposed to take that long and I am wondering if the problem lies within my code, since I am fairly new to Spark. Here it is: df = pd.read_csv(http://archive.ics.uci.edu/ml/machine-learning-databases/poker/poker-hand-testing.data) dataframe = sqlContext.createDataFrame