apache-spark-ml

How to set parameters for a custom PySpark Transformer once it's a stage in a fitted ML Pipeline?

☆樱花仙子☆ 提交于 2019-12-01 12:14:52
问题 I've written a custom ML Pipeline Estimator and Transformer for my own Python algorithm by following the pattern shown here. However, in that example all the parameters needed by _transform() were conveniently passed into the Model/Transformer by the estimator's _fit() method. But my transformer has several parameters that control the way the transform is applied. These parameters are specific to the transformer so it would feel odd to pass them into the estimator in advance along with the

LinearRegression scala.MatchError:

妖精的绣舞 提交于 2019-12-01 11:21:13
I am getting a scala.MatchError when using a ParamGridBuilder in Spark 1.6.1 and 2.0 val paramGrid = new ParamGridBuilder() .addGrid(lr.regParam, Array(0.1, 0.01)) .addGrid(lr.fitIntercept) .addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)) .build() Error is org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 57.0 failed 1 times, most recent failure: Lost task 0.0 in stage 57.0 (TID 257, localhost): scala.MatchError: [280000,1.0,[2400.0,9373.0,3.0,1.0,1.0,0.0,0.0,0.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) Full code The

LinearRegression scala.MatchError:

亡梦爱人 提交于 2019-12-01 09:47:31
问题 I am getting a scala.MatchError when using a ParamGridBuilder in Spark 1.6.1 and 2.0 val paramGrid = new ParamGridBuilder() .addGrid(lr.regParam, Array(0.1, 0.01)) .addGrid(lr.fitIntercept) .addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)) .build() Error is org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 57.0 failed 1 times, most recent failure: Lost task 0.0 in stage 57.0 (TID 257, localhost): scala.MatchError: [280000,1.0,[2400.0,9373.0,3.0,1.0,1.0,0.0,0

Extract results from CrossValidator with paramGrid in pySpark

大城市里の小女人 提交于 2019-12-01 09:31:05
I train a Random Forest with pySpark. I want to have a csv with the results, per dot in the grid. My code is: estimator = RandomForestRegressor() evaluator = RegressionEvaluator() paramGrid = ParamGridBuilder().addGrid(estimator.numTrees, [2,3])\ .addGrid(estimator.maxDepth, [2,3])\ .addGrid(estimator.impurity, ['variance'])\ .addGrid(estimator.featureSubsetStrategy, ['sqrt'])\ .build() pipeline = Pipeline(stages=[estimator]) crossval = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=3) cvModel = crossval.fit(result) So I want a csv: numTrees |

How to understand the format type of libsvm of Spark MLlib?

依然范特西╮ 提交于 2019-12-01 09:30:40
I am new for learning Spark MLlib. When I was reading about the example of Binomial logistic regression, I don't understand the format type of "libsvm". ( Binomial logistic regression ) The text looks like: 0 128:51 129:159 130:253 131:159 132:50 155:48 156:238 157:252 158:252 159:252 160:237 182:54 183:227 184:253 185:252 186:239 187:233 188:252 189:57 190:6 208:10 209:60 210:224 211:252 212:253 213:252 214:202 215:84 216:252 217:253 218:122 236:163 237:252 238:252 239:252 240:253 241:252 242:252 243:96 244:189 245:253 246:167 263:51 264:238 265:253 266:253 267:190 268:114 269:253 270:228 271

Making the features of test data same as train data after featureselection in spark

流过昼夜 提交于 2019-12-01 09:21:24
问题 I m working on Scala. I have a big question, ChiSqSelector seems to reduce dimension successfully, but I can't identify what features were reduced what were remained. How can I know what features were reduced? [WrappedArray(a, b, c),(5,[1,2,3],[1,1,1]),(2,[0],[1])] [WrappedArray(b, d, e),(5,[0,2,4],[1,1,2]),(2,[1],[2])] [WrappedArray(a, c, d),(5,[0,1,3],[1,1,1]),(2,[0],[1])] PS: when I wanted to make the test data same as feature-selected train data I found that I dont know how to do that in

Join two Spark mllib pipelines together

匆匆过客 提交于 2019-12-01 08:48:49
I have two separate DataFrames which each have several differing processing stages which I use mllib transformers in a pipeline to handle. I now want to join these two pipelines together, keeping the features (columns) from each DataFrame . Scikit-learn has the FeatureUnion class for handling this, and I can't seem to find an equivalent for mllib . I can add a custom transformer stage at the end of one pipeline that takes the DataFrame produced by the other pipeline as an attribute and join it in the transform method, but that seems messy. Pipeline or PipelineModel are valid PipelineStages ,

Extract results from CrossValidator with paramGrid in pySpark

岁酱吖の 提交于 2019-12-01 08:46:46
问题 I train a Random Forest with pySpark. I want to have a csv with the results, per dot in the grid. My code is: estimator = RandomForestRegressor() evaluator = RegressionEvaluator() paramGrid = ParamGridBuilder().addGrid(estimator.numTrees, [2,3])\ .addGrid(estimator.maxDepth, [2,3])\ .addGrid(estimator.impurity, ['variance'])\ .addGrid(estimator.featureSubsetStrategy, ['sqrt'])\ .build() pipeline = Pipeline(stages=[estimator]) crossval = CrossValidator(estimator=pipeline, estimatorParamMaps

How to understand the format type of libsvm of Spark MLlib?

元气小坏坏 提交于 2019-12-01 08:25:53
问题 I am new for learning Spark MLlib. When I was reading about the example of Binomial logistic regression, I don't understand the format type of "libsvm". (Binomial logistic regression) The text looks like: 0 128:51 129:159 130:253 131:159 132:50 155:48 156:238 157:252 158:252 159:252 160:237 182:54 183:227 184:253 185:252 186:239 187:233 188:252 189:57 190:6 208:10 209:60 210:224 211:252 212:253 213:252 214:202 215:84 216:252 217:253 218:122 236:163 237:252 238:252 239:252 240:253 241:252 242

Column name with dot spark

江枫思渺然 提交于 2019-11-30 17:46:18
I am trying to take columns from a DataFrame and convert it to an RDD[Vector] . The problem is that I have columns with a "dot" in their name as the following dataset : "col0.1","col1.2","col2.3","col3.4" 1,2,3,4 10,12,15,3 1,12,10,5 This is what I'm doing : val df = spark.read.format("csv").options(Map("header" -> "true", "inferSchema" -> "true")).load("C:/Users/mhattabi/Desktop/donnee/test.txt") val column=df.columns.map(c=>s"`${c}`") val rows = new VectorAssembler().setInputCols(column).setOutputCol("vs") .transform(df) .select("vs") .rdd val data =rows.map(_.getAs[org.apache.spark.ml