Consider the following example
dtrain <- data_frame(text = c("Chinese Beijing Chinese", "Chinese Chinese Shanghai", "Chinese Macao", "Tokyo Japan Chinese"), doc_id = 1:4, class = c(1, 1, 1, 0)) dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE) > dtrain_spark # Source: table<dtrain> [?? x 3] # Database: spark_connection text doc_id class <chr> <int> <dbl> 1 Chinese Beijing Chinese 1 1 2 Chinese Chinese Shanghai 2 1 3 Chinese Macao 3 1 4 Tokyo Japan Chinese 4 0
Here I have the classic Naive Bayes example where class
identifies documents falling into the China
category.
I am able to run a Naives Bayes classifier in sparklyr
by doing the following:
dtrain_spark %>% ft_tokenizer(input.col = "text", output.col = "tokens") %>% ft_count_vectorizer(input_col = 'tokens', output_col = 'myvocab') %>% select(myvocab, class) %>% ml_naive_bayes( label_col = "class", features_col = "myvocab", prediction_col = "pcol", probability_col = "prcol", raw_prediction_col = "rpcol", model_type = "multinomial", smoothing = 0.6, thresholds = c(0.2, 0.4))
which outputs:
NaiveBayesModel (Transformer) <naive_bayes_5e946aec597e> (Parameters -- Column Names) features_col: myvocab label_col: class prediction_col: pcol probability_col: prcol raw_prediction_col: rpcol (Transformer Info) num_classes: int 2 num_features: int 6 pi: num [1:2] -1.179 -0.368 theta: num [1:2, 1:6] -1.417 -0.728 -2.398 -1.981 -2.398 ... thresholds: num [1:2] 0.2 0.4
However, I have two major questions:
How can I assess the performance of this classifier in-sample? Where are the accuracy metrics?
Even more importantly, how can I use this trained model to predict new values, say, in the following spark
test dataframe?
Test data:
dtest <- data_frame(text = c("Chinese Chinese Chinese Tokyo Japan", "random stuff")) dtest_spark <- copy_to(sc, dtest, overwrite = TRUE) > dtest_spark # Source: table<dtest> [?? x 1] # Database: spark_connection text <chr> 1 Chinese Chinese Chinese Tokyo Japan 2 random stuff
Thanks!
How can I assess the performance of this classifier in-sample? Where are the accuracy metrics?
In general (there are some models which provide some form of summary), evaluation on training dataset is a separate step in Apache Spark. This fits nicely in the native Pipeline
API.
Background:
Spark ML Pipelines are primarily build from two types of objects:
Transformers
- objects which provide transform
method, which map DataFrame
to updated DataFrame
.
You can transform
using Transformer
with ml_transform
method.
Estimators
- objects which provide fit
method, which map DataFrame
to Transfomer
. By convention corresponding Estimator
/ Transformer
pairs are called Foo
/ FooModel
.
You can fit
Estimator
in sparklyr
using ml_fit
model.
Additionally ML Pipelines can be combined with Evaluators
(see ml_*_evaluator
and ml_*_eval
methods) which can be used to compute different metrics on the transformed data, based on columns generated by a model (usually probability column or raw prediction).
You can apply Evaluator
using ml_evaluate
method.
Are related components include cross validator and train validation splits, which can be used for parameter tuning.
Examples:
sparklyr
PipelineStages
can be evaluated eagerly (as in your own code), by passing data directly, or lazily by passing a spark_connection
instance and calling aforementioned methods (ml_fit
, ml_transform
, etc.).
It means you can define a Pipeline
as follows:
pipeline <- ml_pipeline( ft_tokenizer(sc, input.col = "text", output.col = "tokens"), ft_count_vectorizer(sc, input_col = 'tokens', output_col = 'myvocab'), ml_naive_bayes(sc, label_col = "class", features_col = "myvocab", prediction_col = "pcol", probability_col = "prcol", raw_prediction_col = "rpcol", model_type = "multinomial", smoothing = 0.6, thresholds = c(0.2, 0.4), uid = "nb") )
Fit the PipelineModel
:
model <- ml_fit(pipeline, dtrain_spark)
Transform, and apply one of available Evaluators
:
ml_transform(model, dtrain_spark) %>% ml_binary_classification_evaluator( label_col="class", raw_prediction_col= "rpcol", metric_name = "areaUnderROC")
[1] 1
or
evaluator <- ml_multiclass_classification_evaluator( sc, label_col="class", prediction_col= "pcol", metric_name = "f1") ml_evaluate(evaluator, ml_transform(model, dtrain_spark))
[1] 1
Even more importantly, how can I use this trained model to predict new values, say, in the following spark test dataframe?
Use either ml_transform
or ml_predict
(the latter one is a convince wrapper, which applies further transformations on the output):
ml_transform(model, dtest_spark)
# Source: table<sparklyr_tmp_cc651477ec7> [?? x 6] # Database: spark_connection text tokens myvocab rpcol prcol pcol <chr> <list> <list> <list> <list> <dbl> 1 Chinese Chinese Chinese Tokyo Japan <list [5]> <dbl [6]> <dbl [… <dbl … 0 2 random stuff <list [2]> <dbl [6]> <dbl [… <dbl … 1
Cross validation:
There is not enough data in the example, but you cross validate and fit hyperparameters as shown below:
# dontrun ml_cross_validator( dtrain_spark, pipeline, list(nb=list(smoothing=list(0.8, 1.0))), # Note that name matches UID evaluator=evaluator)
Notes: