apache-spark-mllib | 易学教程

How to access parameters of the underlying model in ML Pipeline?

阅读更多关于 How to access parameters of the underlying model in ML Pipeline?

问题 I have a DataFrame that is processed with LinearRegression. If I do it directly, like below, I can display the details of the model: val lr = new LinearRegression() val lrModel = lr.fit(df) lrModel: org.apache.spark.ml.regression.LinearRegressionModel = linReg_b22a7bb88404 println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}") Coefficients: [0.9705748115939526] Intercept: 0.31041486689532866 However, if I use it inside a pipeline (like in the simplified example

PySpark MLLib Random Forest classifier repeatability issue

阅读更多关于 PySpark MLLib Random Forest classifier repeatability issue

问题 I am running into this situation where I have no clue what's going with the PySpark Random Forest classifier. I want the model to be reproducible given the same training data. To do so, I added the seed parameter to an integer value as recommended on this page. https://spark.apache.org/docs/2.4.1/api/java/org/apache/spark/mllib/tree/RandomForest.html. This seed parameter is the random seed for bootstrapping and choosing feature subsets. Now, I verified the model and they are absolutely

PySpark MLLib Random Forest classifier repeatability issue

阅读更多关于 PySpark MLLib Random Forest classifier repeatability issue

Any way to access methods from individual stages in PySpark PipelineModel?

阅读更多关于 Any way to access methods from individual stages in PySpark PipelineModel?

问题 I've created a PipelineModel for doing LDA in Spark 2.0 (via PySpark API): def create_lda_pipeline(minTokenLength=1, minDF=1, minTF=1, numTopics=10, seed=42, pattern='[\W]+'): """ Create a pipeline for running an LDA model on a corpus. This function does not need data and will not actually do any fitting until invoked by the caller. Args: minTokenLength: minDF: minimum number of documents word is present in corpus minTF: minimum number of times word is found in a document numTopics: seed:

Any way to access methods from individual stages in PySpark PipelineModel?

阅读更多关于 Any way to access methods from individual stages in PySpark PipelineModel?

PySpark: Output of OneHotEncoder looks odd [duplicate]

阅读更多关于 PySpark: Output of OneHotEncoder looks odd [duplicate]

问题 This question already has an answer here : Spark ML VectorAssembler returns strange output (1 answer) Closed 2 years ago . The Spark documentation contains a PySpark example for its OneHotEncoder : from pyspark.ml.feature import OneHotEncoder, StringIndexer df = spark.createDataFrame([ (0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c") ], ["id", "category"]) stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex") model = stringIndexer.fit(df) indexed = model

apply OneHotEncoder for several categorical columns in SparkMlib

阅读更多关于 apply OneHotEncoder for several categorical columns in SparkMlib

问题 I have several categorical features and would like to transform them all using OneHotEncoder . However, when I tried to apply the StringIndexer , there I get an error: stringIndexer = StringIndexer( inputCol = ['a', 'b','c','d'], outputCol = ['a_index', 'b_index','c_index','d_index'] ) model = stringIndexer.fit(Data) An error occurred while calling o328.fit. : java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String at org.apache.spark.ml.feature.StringIndexer.fit

How to use XGboost in PySpark Pipeline

阅读更多关于 How to use XGboost in PySpark Pipeline

问题 I want to update my code of pyspark. In the pyspark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model. However, it seems not be able to use XGboost model in the pipeline api. How can I use the pyspark like this from xgboost import XGBClassifier ... model = XGBClassifier() model.fit(X_train, y_train) pipeline = Pipeline(stages=[..., model, ...]) ... It is convenient to use the pipeline api, so can anybody give some advices?

sparkml setParallelism for crossvalidator

阅读更多关于 sparkml setParallelism for crossvalidator

问题 so I am trying to set a cross validation using SparkML but I am getting a run time error saying that "value setParallelism is not a member of org.apache.spark.ml.tuning.CrossValidator" I am currently following the spark page tutorial. I am new to this so any help is appreciated. Bellow is my code snippet: import org.apache.spark.ml.{Pipeline, PipelineModel} import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.feature.{HashingTF, Tokenizer} import org.apache

How to calculate TF-IDF on grouped spark dataframe in scala?

阅读更多关于 How to calculate TF-IDF on grouped spark dataframe in scala?

问题 I have used Spark Api (https://spark.apache.org/docs/latest/ml-features.html#tf-idf) for calculating TF IDF on a dataframe. What I am unable to do is to do it on grouped data using Dataframe groupBy and calculating TFIDF for each group and in the result getting single dataframe. For Example for input id | category | texts 0 | smallLetters | Array("a", "b", "c") 1 | smallLetters | Array("a", "b", "b", "c", "a") 2 | capitalLetters | Array("A", "B", "C") 3 | capitalLetters | Array("A", "B", "B",