问题
With Spark MLLib, I'd build a model (like RandomForest
), and then it was possible to eval it outside of Spark by loading the model and using predict
on it passing a vector of features.
It seems like with Spark ML, predict
is now called transform
and only acts on a DataFrame
.
Is there any way to build a DataFrame
outside of Spark since it seems like one needs a SparkContext to build a DataFrame?
Am I missing something?
回答1:
Re: Is there any way to build a DataFrame outside of Spark?
It is not possible. DataFrames live inside SQLContext with it living in SparkContext. Perhaps you could work it around somehow, but the whole story is that the connection between DataFrames and SparkContext is by design.
回答2:
Here is my solution to use spark models outside of spark context (using PMML):
- You create model with a pipeline like this:
SparkConf sparkConf = new SparkConf();
SparkSession session = SparkSession.builder().enableHiveSupport().config(sparkConf).getOrCreate();
String tableName = "schema.table";
Properties dbProperties = new Properties();
dbProperties.setProperty("user",vKey);
dbProperties.setProperty("password",password);
dbProperties.setProperty("AuthMech","3");
dbProperties.setProperty("source","jdbc");
dbProperties.setProperty("driver","com.cloudera.impala.jdbc41.Driver");
String tableName = "schema.table";
String simpleUrl = "jdbc:impala://host:21050/schema"
Dataset<Row> data = session.read().jdbc(simpleUrl ,tableName,dbProperties);
String[] inputCols = {"column1"};
StringIndexer indexer = new StringIndexer().setInputCol("column1").setOutputCol("indexed_column1");
StringIndexerModel alphabet = indexer.fit(data);
data = alphabet.transform(data);
VectorAssembler assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
Predictor p = new GBTRegressor();
p.set("maxIter",20);
p.set("maxDepth",2);
p.set("maxBins",204);
p.setLabelCol("faktor");
PipelineStage[] stages = {indexer,assembler, p};
Pipeline pipeline = new Pipeline();
pipeline.setStages(stages);
PipelineModel pmodel = pipeline.fit(data);
PMML pmml = ConverterUtil.toPMML(data.schema(),pmodel);
FileOutputStream fos = new FileOutputStream("model.pmml");
JAXBUtil.marshalPMML(pmml,new StreamResult(fos));
Using PPML for predictions (locally, without spark context, which can be applied to a Map of arguments and not on a DataFrame):
PMML pmml = org.jpmml.model.PMMLUtil.unmarshal(new FileInputStream(pmmlFile)); ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance(); MiningModelEvaluator evaluator = (MiningModelEvaluator) modelEvaluatorFactory.newModelEvaluator(pmml); inputFieldMap = new HashMap<String, Field>(); Map<FieldName,String> args = new HashMap<FieldName, String>(); Field curField = evaluator.getInputFields().get(0); args.put(curField.getName(), "1.0"); Map<FieldName, ?> result = evaluator.evaluate(args);
来源:https://stackoverflow.com/questions/36001684/how-to-eval-spark-ml-model-without-dataframes-sparkcontext