How to implement Stanford CoreNLP wrapper for Apache Spark using sparklyr?

廉价感情. 提交于 2019-12-05 08:14:28
user6910411

com.databricks.spark.corenlp.functions is an object, not a class, therefore calling is meaningless. This is basically what the error message says:

Error: java.lang.Exception: No matched constructor found for class com.databricks.spark.corenlp.functions

Instead you should access defined functions using invoke_static, for example:

invoke_static(sc,"com.databricks.spark.corenlp.functions", "cleanxml")
<jobj[15]>
org.apache.spark.sql.expressions.UserDefinedFunction
UserDefinedFunction(<function1>,StringType,Some(List(StringType)))

With example data borrowed from the official README

df <- copy_to(sc, tibble(
  id = 1,
  text = "<xml>Stanford University is located in California. It is a great university.</xml>"
))

you can define a wrapper like this:

sdf_cleanxml <- function(df, input_col, output_col) {
  sc <- df$src$con
  clean_xml <- invoke_static(sc,"com.databricks.spark.corenlp.functions", "cleanxml")
  arg <- list(invoke_static(sc, "org.apache.spark.sql.functions", "col", input_col))
  expr <- invoke(clean_xml, "apply", arg)
  df %>%
    spark_dataframe() %>% 
    invoke("withColumn", output_col, expr) %>%
    sdf_register()
}

and invoke it as follows:

sdf_cleanxml(df, "text", "text_clean")
# Source: spark<?> [?? x 3]
    id text                                 text_clean                         
  <dbl> <chr>                                <chr>                              
1     1 <xml>Stanford University is located… Stanford University is located in …

In practice though it might be simpler to just register required functions:

register_core_nlp <- function(sc) {
  funs <- c(
    "cleanxml", "tokenize", "ssplit", "pos", "lemma", "ner", "depparse",
    "coref", "natlog", "openie", "sentiment"
  )
  udf_registration <- sparklyr::invoke(sparklyr::spark_session(sc), "udf")
  for (fun in funs) {
    sparklyr::invoke(
      udf_registration, "register", fun,
      sparklyr::invoke_static(sc,"com.databricks.spark.corenlp.functions", fun)
    )
   }
}

register_core_nlp(sc)

and let SQL translation do the rest:

df %>% 
  transmute(doc = cleanxml(text)) %>%
  transmute(sen = explode(ssplit(doc))) %>%
  mutate(words = tokenize(sen), ner_tags = ner(sen), sentiment = sentiment(sen))
# Source: spark<?> [?? x 4]
  sen                                            words      ner_tags   sentiment
  <chr>                                          <list>     <list>         <int>
1 Stanford University is located in California . <list [7]> <list [7]>         1
2 It is a great university .                     <list [6]> <list [6]>         4
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!