apache-spark-ml | 易学教程

Interpreting coefficientMatrix, interceptVector and Confusion matrix on multinomial logistic regression

阅读更多关于 Interpreting coefficientMatrix, interceptVector and Confusion matrix on multinomial logistic regression

问题 Can anyone explain how to interpret coefficientMatrix , interceptVector , Confusion matrix of a multinomial logistic regression . According to Spark documentation: Multiclass classification is supported via multinomial logistic (softmax) regression. In multinomial logistic regression, the algorithm produces K sets of coefficients, or a matrix of dimension K×J where K is the number of outcome classes and J is the number of features. If the algorithm is fit with an intercept term then a length

Interpreting coefficientMatrix, interceptVector and Confusion matrix on multinomial logistic regression

阅读更多关于 Interpreting coefficientMatrix, interceptVector and Confusion matrix on multinomial logistic regression

Any way to access methods from individual stages in PySpark PipelineModel?

阅读更多关于 Any way to access methods from individual stages in PySpark PipelineModel?

问题 I've created a PipelineModel for doing LDA in Spark 2.0 (via PySpark API): def create_lda_pipeline(minTokenLength=1, minDF=1, minTF=1, numTopics=10, seed=42, pattern='[\W]+'): """ Create a pipeline for running an LDA model on a corpus. This function does not need data and will not actually do any fitting until invoked by the caller. Args: minTokenLength: minDF: minimum number of documents word is present in corpus minTF: minimum number of times word is found in a document numTopics: seed:

Any way to access methods from individual stages in PySpark PipelineModel?

阅读更多关于 Any way to access methods from individual stages in PySpark PipelineModel?

apply OneHotEncoder for several categorical columns in SparkMlib

阅读更多关于 apply OneHotEncoder for several categorical columns in SparkMlib

问题 I have several categorical features and would like to transform them all using OneHotEncoder . However, when I tried to apply the StringIndexer , there I get an error: stringIndexer = StringIndexer( inputCol = ['a', 'b','c','d'], outputCol = ['a_index', 'b_index','c_index','d_index'] ) model = stringIndexer.fit(Data) An error occurred while calling o328.fit. : java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String at org.apache.spark.ml.feature.StringIndexer.fit

How to use XGboost in PySpark Pipeline

阅读更多关于 How to use XGboost in PySpark Pipeline

问题 I want to update my code of pyspark. In the pyspark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model. However, it seems not be able to use XGboost model in the pipeline api. How can I use the pyspark like this from xgboost import XGBClassifier ... model = XGBClassifier() model.fit(X_train, y_train) pipeline = Pipeline(stages=[..., model, ...]) ... It is convenient to use the pipeline api, so can anybody give some advices?

Spark Java IllegalArgumentException at org.apache.xbean.asm5.ClassReader

阅读更多关于 Spark Java IllegalArgumentException at org.apache.xbean.asm5.ClassReader

问题 I'm trying to use Spark 2.3.1 with Java. I followed examples in the documentation but keep getting poorly described exception when calling .fit(trainingData) . Exception in thread "main" java.lang.IllegalArgumentException at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source) at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source) at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source) at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46) at org.apache

Explode sparse features vector into separate columns

阅读更多关于 Explode sparse features vector into separate columns

问题 In my spark DataFrame I have a column which includes the output of a CountVectoriser transformation - it is in sparse vector format. What I am trying to do is to 'explode' this column again into a dense vector and then it's component rows (so that it can be used for scoring by an external model). I know there are 40 features in the column, hence Following this example, I have tried: import org.apache.spark.sql.functions.udf import org.apache.spark.mllib.linalg.Vector // convert sparse vector

Getting the leaf probabilities of a tree model in spark

阅读更多关于 Getting the leaf probabilities of a tree model in spark

问题 I'm trying to refactor a trained spark tree-based model (RandomForest or GBT classifiers) in such a way it can be exported in environments without spark. The toDebugString method is a good starting point. However, in the case of RandomForestClassifier , the string just shows the predicted class for each tree, without the relative probabilities. So, if you average the prediction for all the trees, you get a wrong result. An example. We have a DecisionTree represented in this way:

Spark MLlib example, NoSuchMethodError: org.apache.spark.sql.SQLContext.createDataFrame()

阅读更多关于 Spark MLlib example, NoSuchMethodError: org.apache.spark.sql.SQLContext.createDataFrame()

问题 I'm following the documentation example Example: Estimator, Transformer, and Param And I got error msg 15/09/23 11:46:51 INFO BlockManagerMaster: Registered BlockManager Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror; at SimpleApp$.main(hw.scala:75) And line 75 is the code "sqlContext.createDataFrame()": import java.util.Random import org.apache.log4j.Logger import org