apache-spark-mllib

How to access parameters of the underlying model in ML Pipeline?

会有一股神秘感。 提交于 2020-05-30 03:22:06
问题 I have a DataFrame that is processed with LinearRegression. If I do it directly, like below, I can display the details of the model: val lr = new LinearRegression() val lrModel = lr.fit(df) lrModel: org.apache.spark.ml.regression.LinearRegressionModel = linReg_b22a7bb88404 println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}") Coefficients: [0.9705748115939526] Intercept: 0.31041486689532866 However, if I use it inside a pipeline (like in the simplified example

PySpark MLLib Random Forest classifier repeatability issue

℡╲_俬逩灬. 提交于 2020-05-16 01:31:21
问题 I am running into this situation where I have no clue what's going with the PySpark Random Forest classifier. I want the model to be reproducible given the same training data. To do so, I added the seed parameter to an integer value as recommended on this page. https://spark.apache.org/docs/2.4.1/api/java/org/apache/spark/mllib/tree/RandomForest.html. This seed parameter is the random seed for bootstrapping and choosing feature subsets. Now, I verified the model and they are absolutely

PySpark MLLib Random Forest classifier repeatability issue

浪子不回头ぞ 提交于 2020-05-16 01:31:11
问题 I am running into this situation where I have no clue what's going with the PySpark Random Forest classifier. I want the model to be reproducible given the same training data. To do so, I added the seed parameter to an integer value as recommended on this page. https://spark.apache.org/docs/2.4.1/api/java/org/apache/spark/mllib/tree/RandomForest.html. This seed parameter is the random seed for bootstrapping and choosing feature subsets. Now, I verified the model and they are absolutely

Any way to access methods from individual stages in PySpark PipelineModel?

て烟熏妆下的殇ゞ 提交于 2020-05-13 05:15:04
问题 I've created a PipelineModel for doing LDA in Spark 2.0 (via PySpark API): def create_lda_pipeline(minTokenLength=1, minDF=1, minTF=1, numTopics=10, seed=42, pattern='[\W]+'): """ Create a pipeline for running an LDA model on a corpus. This function does not need data and will not actually do any fitting until invoked by the caller. Args: minTokenLength: minDF: minimum number of documents word is present in corpus minTF: minimum number of times word is found in a document numTopics: seed:

Any way to access methods from individual stages in PySpark PipelineModel?

依然范特西╮ 提交于 2020-05-13 05:12:51
问题 I've created a PipelineModel for doing LDA in Spark 2.0 (via PySpark API): def create_lda_pipeline(minTokenLength=1, minDF=1, minTF=1, numTopics=10, seed=42, pattern='[\W]+'): """ Create a pipeline for running an LDA model on a corpus. This function does not need data and will not actually do any fitting until invoked by the caller. Args: minTokenLength: minDF: minimum number of documents word is present in corpus minTF: minimum number of times word is found in a document numTopics: seed:

PySpark: Output of OneHotEncoder looks odd [duplicate]

こ雲淡風輕ζ 提交于 2020-03-25 18:23:16
问题 This question already has an answer here : Spark ML VectorAssembler returns strange output (1 answer) Closed 2 years ago . The Spark documentation contains a PySpark example for its OneHotEncoder : from pyspark.ml.feature import OneHotEncoder, StringIndexer df = spark.createDataFrame([ (0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c") ], ["id", "category"]) stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex") model = stringIndexer.fit(df) indexed = model

apply OneHotEncoder for several categorical columns in SparkMlib

时光毁灭记忆、已成空白 提交于 2020-03-17 09:03:41
问题 I have several categorical features and would like to transform them all using OneHotEncoder . However, when I tried to apply the StringIndexer , there I get an error: stringIndexer = StringIndexer( inputCol = ['a', 'b','c','d'], outputCol = ['a_index', 'b_index','c_index','d_index'] ) model = stringIndexer.fit(Data) An error occurred while calling o328.fit. : java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String at org.apache.spark.ml.feature.StringIndexer.fit

How to use XGboost in PySpark Pipeline

匆匆过客 提交于 2020-02-19 14:26:30
问题 I want to update my code of pyspark. In the pyspark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model. However, it seems not be able to use XGboost model in the pipeline api. How can I use the pyspark like this from xgboost import XGBClassifier ... model = XGBClassifier() model.fit(X_train, y_train) pipeline = Pipeline(stages=[..., model, ...]) ... It is convenient to use the pipeline api, so can anybody give some advices?

sparkml setParallelism for crossvalidator

拟墨画扇 提交于 2020-02-02 14:30:10
问题 so I am trying to set a cross validation using SparkML but I am getting a run time error saying that "value setParallelism is not a member of org.apache.spark.ml.tuning.CrossValidator" I am currently following the spark page tutorial. I am new to this so any help is appreciated. Bellow is my code snippet: import org.apache.spark.ml.{Pipeline, PipelineModel} import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.feature.{HashingTF, Tokenizer} import org.apache

How to calculate TF-IDF on grouped spark dataframe in scala?

拈花ヽ惹草 提交于 2020-01-24 17:32:09
问题 I have used Spark Api (https://spark.apache.org/docs/latest/ml-features.html#tf-idf) for calculating TF IDF on a dataframe. What I am unable to do is to do it on grouped data using Dataframe groupBy and calculating TFIDF for each group and in the result getting single dataframe. For Example for input id | category | texts 0 | smallLetters | Array("a", "b", "c") 1 | smallLetters | Array("a", "b", "b", "c", "a") 2 | capitalLetters | Array("A", "B", "C") 3 | capitalLetters | Array("A", "B", "B",