问题
I have a Spark dataframe with a column of characters
as 20/01/2000 (day/month/year).
But I'm trying to change it to date format, so I'd be able to use the funcitons here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions to get only the data I want (as for extract months and days, for example).
But seems like the functions only works when I'm using other formats of dates, as 1970-01-30.
An example:
sc <- spark_connect(master = "spark://XXXX")
df <- data.frame(date = c("20/10/2010", "19/11/2010"))
df_tbl <- copy_to(sc, df, "df")
If I want to extract only the month in a new column:
df_tbl <- df_tbl %>% mutate(month = month(date))
I get:
> df_tbl %>% glimpse()
Observations: 2
Variables: 2
$ data <chr> "20/10/2010", "19/11/2010"
$ month <int> NA, NA
Since R's function as.Date()
doesn't work, I'd have to use another tool.
any clues?
回答1:
As already figured out, this fails because 19/11/2010
is not an accepted date format. In Spark 2.2 or later you can:
df_tbl %>% mutate(month = month(to_date(date, "dd/MM/yyyy")))
# # Source: lazy query [?? x 2]
# # Database: spark_connection
# date month
# <chr> <int>
# 1 20/10/2010 10
# 2 19/11/2010 11
In 2.1 or before:
df_tbl %>%
mutate(month = month(from_unixtime(unix_timestamp(date, "dd/MM/yyyy"))))
# # Source: lazy query [?? x 2]
# # Database: spark_connection
# date month
# <chr> <int>
# 1 20/10/2010 10
# 2 19/11/2010 11
and formatting alone:
df_tbl %>%
mutate(formatted = from_unixtime(
unix_timestamp(date, "dd/MM/yyyy"), "dd-MM-yyy"))
# # Source: lazy query [?? x 2]
# # Database: spark_connection
# date formatted
# <chr> <chr>
# 1 20/10/2010 20-10-2010
# 2 19/11/2010 19-11-2010
回答2:
sparklyr doesn't support column type date, yet.
回答3:
You may be able to use Hive (which is what Spark SQL is based on) defined functions to accomplish this, please see: https://spark.rstudio.com/articles/guides-dplyr.html#hive-functions
来源:https://stackoverflow.com/questions/45492203/sparklyr-changing-date-format-in-spark