how to train a ML model in sparklyr and predict new values on another dataframe?

匿名 (未验证) 提交于 2019-12-03 01:12:01

问题:

Consider the following example

dtrain <- data_frame(text = c("Chinese Beijing Chinese",                               "Chinese Chinese Shanghai",                               "Chinese Macao",                               "Tokyo Japan Chinese"),                      doc_id = 1:4,                      class = c(1, 1, 1, 0))  dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE)  > dtrain_spark # Source:   table<dtrain> [?? x 3] # Database: spark_connection   text                     doc_id class   <chr>                     <int> <dbl> 1 Chinese Beijing Chinese       1     1 2 Chinese Chinese Shanghai      2     1 3 Chinese Macao                 3     1 4 Tokyo Japan Chinese           4     0 

Here I have the classic Naive Bayes example where class identifies documents falling into the China category.

I am able to run a Naives Bayes classifier in sparklyr by doing the following:

dtrain_spark %>%  ft_tokenizer(input.col = "text", output.col = "tokens") %>%  ft_count_vectorizer(input_col = 'tokens', output_col = 'myvocab') %>%    select(myvocab, class) %>%     ml_naive_bayes( label_col = "class",                    features_col = "myvocab",                    prediction_col = "pcol",                   probability_col = "prcol",                    raw_prediction_col = "rpcol",                   model_type = "multinomial",                    smoothing = 0.6,                    thresholds = c(0.2, 0.4)) 

which outputs:

NaiveBayesModel (Transformer) <naive_bayes_5e946aec597e>   (Parameters -- Column Names)   features_col: myvocab   label_col: class   prediction_col: pcol   probability_col: prcol   raw_prediction_col: rpcol  (Transformer Info)   num_classes:  int 2    num_features:  int 6    pi:  num [1:2] -1.179 -0.368    theta:  num [1:2, 1:6] -1.417 -0.728 -2.398 -1.981 -2.398 ...    thresholds:  num [1:2] 0.2 0.4  

However, I have two major questions:

  1. How can I assess the performance of this classifier in-sample? Where are the accuracy metrics?

  2. Even more importantly, how can I use this trained model to predict new values, say, in the following spark test dataframe?

Test data:

dtest <- data_frame(text = c("Chinese Chinese Chinese Tokyo Japan",                              "random stuff"))  dtest_spark <- copy_to(sc, dtest, overwrite = TRUE)  > dtest_spark # Source:   table<dtest> [?? x 1] # Database: spark_connection   text                                  <chr>                               1 Chinese Chinese Chinese Tokyo Japan 2 random stuff  

Thanks!

回答1:

How can I assess the performance of this classifier in-sample? Where are the accuracy metrics?

In general (there are some models which provide some form of summary), evaluation on training dataset is a separate step in Apache Spark. This fits nicely in the native Pipeline API.

Background:

Spark ML Pipelines are primarily build from two types of objects:

  • Transformers - objects which provide transform method, which map DataFrame to updated DataFrame.

    You can transform using Transformer with ml_transform method.

  • Estimators - objects which provide fit method, which map DataFrame to Transfomer. By convention corresponding Estimator / Transformer pairs are called Foo / FooModel.

    You can fit Estimator in sparklyr using ml_fit model.

Additionally ML Pipelines can be combined with Evaluators (see ml_*_evaluator and ml_*_eval methods) which can be used to compute different metrics on the transformed data, based on columns generated by a model (usually probability column or raw prediction).

You can apply Evaluator using ml_evaluate method.

Are related components include cross validator and train validation splits, which can be used for parameter tuning.

Examples:

sparklyr PipelineStages can be evaluated eagerly (as in your own code), by passing data directly, or lazily by passing a spark_connection instance and calling aforementioned methods (ml_fit, ml_transform, etc.).

It means you can define a Pipeline as follows:

pipeline <- ml_pipeline(   ft_tokenizer(sc, input.col = "text", output.col = "tokens"),   ft_count_vectorizer(sc, input_col = 'tokens', output_col = 'myvocab'),   ml_naive_bayes(sc, label_col = "class",                features_col = "myvocab",                prediction_col = "pcol",               probability_col = "prcol",                raw_prediction_col = "rpcol",               model_type = "multinomial",                smoothing = 0.6,                thresholds = c(0.2, 0.4),               uid = "nb") ) 

Fit the PipelineModel:

model <- ml_fit(pipeline, dtrain_spark) 

Transform, and apply one of available Evaluators:

ml_transform(model, dtrain_spark) %>%    ml_binary_classification_evaluator(     label_col="class", raw_prediction_col= "rpcol",      metric_name = "areaUnderROC") 
[1] 1 

or

evaluator <- ml_multiclass_classification_evaluator(     sc,     label_col="class", prediction_col= "pcol",      metric_name = "f1")  ml_evaluate(evaluator, ml_transform(model, dtrain_spark)) 
[1] 1 

Even more importantly, how can I use this trained model to predict new values, say, in the following spark test dataframe?

Use either ml_transform or ml_predict (the latter one is a convince wrapper, which applies further transformations on the output):

ml_transform(model, dtest_spark) 
# Source:   table<sparklyr_tmp_cc651477ec7> [?? x 6] # Database: spark_connection   text                                tokens     myvocab   rpcol   prcol   pcol   <chr>                               <list>     <list>    <list>  <list> <dbl> 1 Chinese Chinese Chinese Tokyo Japan <list [5]> <dbl [6]> <dbl [… <dbl …     0 2 random stuff                        <list [2]> <dbl [6]> <dbl [… <dbl …     1 

Cross validation:

There is not enough data in the example, but you cross validate and fit hyperparameters as shown below:

# dontrun ml_cross_validator(   dtrain_spark,   pipeline,    list(nb=list(smoothing=list(0.8, 1.0))),  # Note that name matches UID   evaluator=evaluator) 

Notes:

  • Please keep in mind that, Spark's multinomial Naive Bayes implementation considers only binary feature (0 or not 0).
  • If you use Pipelines with Vector columns (not formula-based calls), I strongly recommend using standardized (default) column names:

    • label for dependent variable.
    • features for assembled independent variables.
    • rawPrediction, prediction, probability for raw prediction, prediction and probability columns respectively.


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!