Sparklyr - Changing date format in Spark

问题

I have a Spark dataframe with a column of characters as 20/01/2000 (day/month/year).

But I'm trying to change it to date format, so I'd be able to use the funcitons here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions to get only the data I want (as for extract months and days, for example).

But seems like the functions only works when I'm using other formats of dates, as 1970-01-30.

An example:

sc <- spark_connect(master = "spark://XXXX")
df <- data.frame(date = c("20/10/2010", "19/11/2010"))
df_tbl <- copy_to(sc, df, "df")

If I want to extract only the month in a new column:

df_tbl <- df_tbl %>% mutate(month = month(date))

I get:

> df_tbl %>% glimpse()
Observations: 2
Variables: 2
$ data  <chr> "20/10/2010", "19/11/2010"
$ month <int> NA, NA

Since R's function as.Date() doesn't work, I'd have to use another tool.

any clues?

回答1:

As already figured out, this fails because 19/11/2010 is not an accepted date format. In Spark 2.2 or later you can:

df_tbl %>% mutate(month = month(to_date(date, "dd/MM/yyyy")))

# # Source:   lazy query [?? x 2]
# # Database: spark_connection
#   date       month
#    <chr>      <int>
# 1 20/10/2010    10
# 2 19/11/2010    11

In 2.1 or before:

df_tbl %>% 
  mutate(month = month(from_unixtime(unix_timestamp(date, "dd/MM/yyyy"))))

# # Source:   lazy query [?? x 2]
# # Database: spark_connection
#   date       month
#   <chr>      <int>
# 1 20/10/2010    10
# 2 19/11/2010    11

and formatting alone:

df_tbl %>%  
   mutate(formatted = from_unixtime(
     unix_timestamp(date, "dd/MM/yyyy"), "dd-MM-yyy"))

# # Source:   lazy query [?? x 2]
# # Database: spark_connection
#   date       formatted 
#   <chr>      <chr>     
# 1 20/10/2010 20-10-2010
# 2 19/11/2010 19-11-2010

回答2:

sparklyr doesn't support column type date, yet.

回答3:

You may be able to use Hive (which is what Spark SQL is based on) defined functions to accomplish this, please see: https://spark.rstudio.com/articles/guides-dplyr.html#hive-functions

来源：https://stackoverflow.com/questions/45492203/sparklyr-changing-date-format-in-spark

标签

date

apache-spark

sparklyr