apache-spark-ml

Interpreting coefficientMatrix, interceptVector and Confusion matrix on multinomial logistic regression

…衆ロ難τιáo~ 提交于 2020-06-13 08:11:10
问题 Can anyone explain how to interpret coefficientMatrix , interceptVector , Confusion matrix of a multinomial logistic regression . According to Spark documentation: Multiclass classification is supported via multinomial logistic (softmax) regression. In multinomial logistic regression, the algorithm produces K sets of coefficients, or a matrix of dimension K×J where K is the number of outcome classes and J is the number of features. If the algorithm is fit with an intercept term then a length

Interpreting coefficientMatrix, interceptVector and Confusion matrix on multinomial logistic regression

帅比萌擦擦* 提交于 2020-06-13 08:10:12
问题 Can anyone explain how to interpret coefficientMatrix , interceptVector , Confusion matrix of a multinomial logistic regression . According to Spark documentation: Multiclass classification is supported via multinomial logistic (softmax) regression. In multinomial logistic regression, the algorithm produces K sets of coefficients, or a matrix of dimension K×J where K is the number of outcome classes and J is the number of features. If the algorithm is fit with an intercept term then a length

Any way to access methods from individual stages in PySpark PipelineModel?

て烟熏妆下的殇ゞ 提交于 2020-05-13 05:15:04
问题 I've created a PipelineModel for doing LDA in Spark 2.0 (via PySpark API): def create_lda_pipeline(minTokenLength=1, minDF=1, minTF=1, numTopics=10, seed=42, pattern='[\W]+'): """ Create a pipeline for running an LDA model on a corpus. This function does not need data and will not actually do any fitting until invoked by the caller. Args: minTokenLength: minDF: minimum number of documents word is present in corpus minTF: minimum number of times word is found in a document numTopics: seed:

Any way to access methods from individual stages in PySpark PipelineModel?

依然范特西╮ 提交于 2020-05-13 05:12:51
问题 I've created a PipelineModel for doing LDA in Spark 2.0 (via PySpark API): def create_lda_pipeline(minTokenLength=1, minDF=1, minTF=1, numTopics=10, seed=42, pattern='[\W]+'): """ Create a pipeline for running an LDA model on a corpus. This function does not need data and will not actually do any fitting until invoked by the caller. Args: minTokenLength: minDF: minimum number of documents word is present in corpus minTF: minimum number of times word is found in a document numTopics: seed:

apply OneHotEncoder for several categorical columns in SparkMlib

时光毁灭记忆、已成空白 提交于 2020-03-17 09:03:41
问题 I have several categorical features and would like to transform them all using OneHotEncoder . However, when I tried to apply the StringIndexer , there I get an error: stringIndexer = StringIndexer( inputCol = ['a', 'b','c','d'], outputCol = ['a_index', 'b_index','c_index','d_index'] ) model = stringIndexer.fit(Data) An error occurred while calling o328.fit. : java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String at org.apache.spark.ml.feature.StringIndexer.fit

How to use XGboost in PySpark Pipeline

匆匆过客 提交于 2020-02-19 14:26:30
问题 I want to update my code of pyspark. In the pyspark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model. However, it seems not be able to use XGboost model in the pipeline api. How can I use the pyspark like this from xgboost import XGBClassifier ... model = XGBClassifier() model.fit(X_train, y_train) pipeline = Pipeline(stages=[..., model, ...]) ... It is convenient to use the pipeline api, so can anybody give some advices?

Spark Java IllegalArgumentException at org.apache.xbean.asm5.ClassReader

那年仲夏 提交于 2020-01-24 03:30:30
问题 I'm trying to use Spark 2.3.1 with Java. I followed examples in the documentation but keep getting poorly described exception when calling .fit(trainingData) . Exception in thread "main" java.lang.IllegalArgumentException at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source) at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source) at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source) at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46) at org.apache

Explode sparse features vector into separate columns

陌路散爱 提交于 2020-01-23 12:34:50
问题 In my spark DataFrame I have a column which includes the output of a CountVectoriser transformation - it is in sparse vector format. What I am trying to do is to 'explode' this column again into a dense vector and then it's component rows (so that it can be used for scoring by an external model). I know there are 40 features in the column, hence Following this example, I have tried: import org.apache.spark.sql.functions.udf import org.apache.spark.mllib.linalg.Vector // convert sparse vector

Getting the leaf probabilities of a tree model in spark

北战南征 提交于 2020-01-21 11:08:09
问题 I'm trying to refactor a trained spark tree-based model (RandomForest or GBT classifiers) in such a way it can be exported in environments without spark. The toDebugString method is a good starting point. However, in the case of RandomForestClassifier , the string just shows the predicted class for each tree, without the relative probabilities. So, if you average the prediction for all the trees, you get a wrong result. An example. We have a DecisionTree represented in this way:

Spark MLlib example, NoSuchMethodError: org.apache.spark.sql.SQLContext.createDataFrame()

≯℡__Kan透↙ 提交于 2020-01-16 06:54:00
问题 I'm following the documentation example Example: Estimator, Transformer, and Param And I got error msg 15/09/23 11:46:51 INFO BlockManagerMaster: Registered BlockManager Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror; at SimpleApp$.main(hw.scala:75) And line 75 is the code "sqlContext.createDataFrame()": import java.util.Random import org.apache.log4j.Logger import org