sparklyr | 易学教程

Converting string/chr to date using sparklyr

阅读更多关于 Converting string/chr to date using sparklyr

问题 I've brought a table into Hue which has a column of dates and i'm trying to play with it using sparklyr in Rstudio. I'd like to convert a character column into a date column like so: Weather_data = mutate(Weather_data, date2 = as.Date(date, "%m/%d/%Y")) and this runs fine but when i check: head(Weather_data) How to I properly convert the chr to dates? Thanks!!!! 回答1: The problem is that sparklyr doesn't correctly support Spark DateType . It is possible to parse dates, and correct format, but

How to use spark_apply to change NaN values?

阅读更多关于 How to use spark_apply to change NaN values?

问题 After using sdf_pivot I was left with a huge number of NaN values, so in order to proceed with my analysis I need to replace the NaN with 0, I have tried using this: data <- data %>% spark_apply(function(e) ifelse(is.nan(e),0,e)) And this gererates the following error: Error in file(con, "r") : cannot open the connection In addition: Warning message: In file(con, "r") : cannot open file 'C:\.........\file18dc5a1c212e_spark.log':Permission denied I'm using Spark 2.2.0 and the latest version of

How to refer to a Spark DataFrame by name in sparklyr and assign it to a variable?

阅读更多关于 How to refer to a Spark DataFrame by name in sparklyr and assign it to a variable?

问题 Say I ran the following code and I forgot to assign the Spark dataframe iris to a variable in R and I can't use .Last.value to assign because I had run some other code right after copying the data to Spark. library(sparklyr) library(dplyr) sc <- spark_connect(master = "local") copy_to(sc, iris) 2+2 # ran some other code so can't use .Last.value How do I assing the Spark dataframe "iris" to a variable in R called iris_tbl ? 回答1: copy_to provides additional name argument By default it is set to

How to remove '\' from a string in sparklyr

阅读更多关于 How to remove '\' from a string in sparklyr

问题 I am using sparklyr and have a spark dataframe with a column word that contains words, some of which contain special characters which I want to remove. I was succesful in using regepx_replace and \\\\ before special characters, just like this: words.sdf <- words.sdf %>% mutate(word = regexp_replace(word, '\\\\(', '')) %>% mutate(word = regexp_replace(word, '\\\\)', '')) %>% mutate(word = regexp_replace(word, '\\\\+', '')) %>% mutate(word = regexp_replace(word, '\\\\?', '')) %>% mutate(word =

Sparklyr split string (to string)

阅读更多关于 Sparklyr split string (to string)

问题 Trying to split a string in sparklyr then use it for joins/filtering I tried the suggested approach of tokenizing the string then separating it to new columns. Here is a reproducible example (note that I have to translate my NA that turns into a string "NA" after copy_to to actual NA, is there a way not having to do that) x <- data.frame(Id=c(1,2,3,4),A=c('A-B','A-C','A-D',NA)) df <- copy_to(sc,x,'df') df %>% mutate(A = ifelse(A=='NA',NA,A)) %>% ft_regex_tokenizer(input.col="A", output.col="B

Sparklyr - Changing date format in Spark

阅读更多关于 Sparklyr - Changing date format in Spark

问题 I have a Spark dataframe with a column of characters as 20/01/2000 (day/month/year). But I'm trying to change it to date format, so I'd be able to use the funcitons here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions to get only the data I want (as for extract months and days, for example). But seems like the functions only works when I'm using other formats of dates, as 1970-01-30. An example: sc <- spark_connect(master = "spark://XXXX")

how to use spark_apply_bundle

阅读更多关于 how to use spark_apply_bundle

问题 I am trying to use spark_apply_bundle to limit the number of packages/data transferred to the worker nodes on a YARN managed cluster. As mentioned in here I must pass the path of the tarball to spark_apply as the packages argument and I also must make it available via "sparklyr.shell.files" in the spark config. My questions are: Can the path to the tarball be relative to the project's working directory, if not then should it be stored on hdfs or somewhere else? What should be passed to

sparklyr - Including null values in an Apache Spark Join

阅读更多关于 sparklyr - Including null values in an Apache Spark Join

问题 The question Including null values in an Apache Spark Join has answers for Scala, PySpark and SparkR, but not for sparklyr. I've been unable to figure out how to have inner_join in sparklyr treat null values in a join column as equal. Does anyone know how this can be done in sparklyr? 回答1: You can invoke an implicit cross join: #' Return a Cartesian product of Spark tables #' #' @param df1 tbl_spark #' @param df2 tbl_spark #' @param explicit logical If TRUE use crossJoin otherwise #' join

Unnest (seperate) multiple column values into new rows using Sparklyr

阅读更多关于 Unnest (seperate) multiple column values into new rows using Sparklyr

问题 I am trying to split column values separated by comma(,) into new rows based on id's. I know how to do this in R using dplyr and tidyr . But I am looking to solve same problem in sparklyr . id <- c(1,1,1,1,1,2,2,2,3,3,3) name <- c("A,B,C","B,F","C","D,R,P","E","A,Q,W","B,J","C","D,M","E,X","F,E") value <- c("1,2,3","2,4,43,2","3,1,2,3","1","1,2","26,6,7","3,3,4","1","1,12","2,3,3","3") dt <- data.frame(id,name,value) R solution: separate_rows(dt, name, sep=",") %>% separate_rows(value, sep=",

How do I configure driver memory when running Spark in local mode via Sparklyr?

阅读更多关于 How do I configure driver memory when running Spark in local mode via Sparklyr?

问题 I am using Sparklyr to run a Spark application in local mode on a virtual machine with 244GB of RAM. In my code I use spark_read_csv() to read in ~50MB of csvs from one folder and then ~1.5GB of csvs from a second folder. My issue is that the application throws an error when trying to read in the second folder. As I understand it, the issue is that the default RAM available to the driver JVM is 512MB - too small for the second folder (in local mode all operations are run within the driver JVM