Sparklyr - Changing date format in Spark

允我心安 提交于 2019-12-11 01:28:56

问题


I have a Spark dataframe with a column of characters as 20/01/2000 (day/month/year).

But I'm trying to change it to date format, so I'd be able to use the funcitons here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions to get only the data I want (as for extract months and days, for example).

But seems like the functions only works when I'm using other formats of dates, as 1970-01-30.

An example:

sc <- spark_connect(master = "spark://XXXX")
df <- data.frame(date = c("20/10/2010", "19/11/2010"))
df_tbl <- copy_to(sc, df, "df")

If I want to extract only the month in a new column:

df_tbl <- df_tbl %>% mutate(month = month(date))

I get:

> df_tbl %>% glimpse()
Observations: 2
Variables: 2
$ data  <chr> "20/10/2010", "19/11/2010"
$ month <int> NA, NA

Since R's function as.Date() doesn't work, I'd have to use another tool.

any clues?


回答1:


As already figured out, this fails because 19/11/2010 is not an accepted date format. In Spark 2.2 or later you can:

df_tbl %>% mutate(month = month(to_date(date, "dd/MM/yyyy")))

# # Source:   lazy query [?? x 2]
# # Database: spark_connection
#   date       month
#    <chr>      <int>
# 1 20/10/2010    10
# 2 19/11/2010    11

In 2.1 or before:

df_tbl %>% 
  mutate(month = month(from_unixtime(unix_timestamp(date, "dd/MM/yyyy"))))

# # Source:   lazy query [?? x 2]
# # Database: spark_connection
#   date       month
#   <chr>      <int>
# 1 20/10/2010    10
# 2 19/11/2010    11

and formatting alone:

df_tbl %>%  
   mutate(formatted = from_unixtime(
     unix_timestamp(date, "dd/MM/yyyy"), "dd-MM-yyy"))

# # Source:   lazy query [?? x 2]
# # Database: spark_connection
#   date       formatted 
#   <chr>      <chr>     
# 1 20/10/2010 20-10-2010
# 2 19/11/2010 19-11-2010



回答2:


sparklyr doesn't support column type date, yet.




回答3:


You may be able to use Hive (which is what Spark SQL is based on) defined functions to accomplish this, please see: https://spark.rstudio.com/articles/guides-dplyr.html#hive-functions



来源:https://stackoverflow.com/questions/45492203/sparklyr-changing-date-format-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!