How to eval spark.ml model without DataFrames/SparkContext?

问题

With Spark MLLib, I'd build a model (like RandomForest), and then it was possible to eval it outside of Spark by loading the model and using predict on it passing a vector of features.

It seems like with Spark ML, predict is now called transform and only acts on a DataFrame.

Is there any way to build a DataFrame outside of Spark since it seems like one needs a SparkContext to build a DataFrame?

Am I missing something?

回答1:

Re: Is there any way to build a DataFrame outside of Spark?

It is not possible. DataFrames live inside SQLContext with it living in SparkContext. Perhaps you could work it around somehow, but the whole story is that the connection between DataFrames and SparkContext is by design.

回答2:

Here is my solution to use spark models outside of spark context (using PMML):

You create model with a pipeline like this:

SparkConf sparkConf = new SparkConf();

SparkSession session = SparkSession.builder().enableHiveSupport().config(sparkConf).getOrCreate();   
String tableName = "schema.table";
Properties dbProperties = new Properties();
dbProperties.setProperty("user",vKey);
dbProperties.setProperty("password",password);
dbProperties.setProperty("AuthMech","3");
dbProperties.setProperty("source","jdbc");
dbProperties.setProperty("driver","com.cloudera.impala.jdbc41.Driver");
String tableName = "schema.table";
String simpleUrl = "jdbc:impala://host:21050/schema"
Dataset<Row> data = session.read().jdbc(simpleUrl ,tableName,dbProperties);
String[] inputCols = {"column1"};
StringIndexer indexer = new StringIndexer().setInputCol("column1").setOutputCol("indexed_column1");
StringIndexerModel alphabet  = indexer.fit(data);
data = alphabet.transform(data);
VectorAssembler assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
Predictor p = new GBTRegressor();
p.set("maxIter",20);
p.set("maxDepth",2);
p.set("maxBins",204);
p.setLabelCol("faktor");
PipelineStage[] stages = {indexer,assembler, p};
Pipeline pipeline = new Pipeline();
pipeline.setStages(stages);
PipelineModel pmodel = pipeline.fit(data);
PMML pmml = ConverterUtil.toPMML(data.schema(),pmodel);
FileOutputStream fos = new FileOutputStream("model.pmml");
JAXBUtil.marshalPMML(pmml,new StreamResult(fos));

Using PPML for predictions (locally, without spark context, which can be applied to a Map of arguments and not on a DataFrame):

PMML pmml = org.jpmml.model.PMMLUtil.unmarshal(new FileInputStream(pmmlFile));
ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();
MiningModelEvaluator evaluator = (MiningModelEvaluator) modelEvaluatorFactory.newModelEvaluator(pmml);
inputFieldMap = new HashMap<String, Field>();     
Map<FieldName,String> args = new HashMap<FieldName, String>();
Field curField = evaluator.getInputFields().get(0);
args.put(curField.getName(), "1.0");
Map<FieldName, ?> result  = evaluator.evaluate(args);

来源：https://stackoverflow.com/questions/36001684/how-to-eval-spark-ml-model-without-dataframes-sparkcontext

标签

apache-spark

apache-spark-mllib

apache-spark-ml