sparklyr | 易学教程

Sparklyr: how to explode a list column into their own columns in Spark table?

阅读更多关于 Sparklyr: how to explode a list column into their own columns in Spark table?

问题 My question is similar with the one in here, but I'm having problems implementing the answer, and I cannot comment in that thread. So, I have a big CSV file that contains a nested data, which contains 2 columns separated by whitespace (say first column is Y, second column is X). Column X itself is also a comma-separated value. 21.66 2.643227,1.2698358,2.6338573,1.8812188,3.8708665,... 35.15 3.422151,-0.59515584,2.4994135,-0.19701914,4.0771823,... 15.22 2.8302398,1.9080592,-0.68780196,3

How can I use spark_apply() to generate combinations using combn()

阅读更多关于 How can I use spark_apply() to generate combinations using combn()

问题 I would like to use spark to generate the output of the combn() function for a relatively large list of inputs (200 ish), and to varying values of m (2-5), however I am having trouble including this in spark_apply() . A mwe of my current approach (based on this): names_df <- data.frame(name = c("Alice", "Bob", "Cat"), types = c("Human", "Human", "Animal")) combn(names_df$name, 2) name_tbl <- sdf_copy_to(sc = sc, x = names_df, name = "name_table") name_tbl %>% select(name) %>% spark_apply

Sparklyr's spark_apply function seems to run on single executor and fails on moderately-large dataset

阅读更多关于 Sparklyr's spark_apply function seems to run on single executor and fails on moderately-large dataset

问题 I am trying to use spark_apply to run the R function below on a Spark table. This works fine if my input table is small (e.g. 5,000 rows), but after ~30 mins throws an error when the table is moderately large (e.g. 5,000,000 rows): sparklyr worker rscript failure, check worker logs for details Looking at the Spark UI shows that there is only ever a single task being created, and a single executor being applied to this task. Can anyone give advice on why this function is failing for 5 million

Importing cassandra table into spark via sparklyr - possible to select only some columns?

阅读更多关于 Importing cassandra table into spark via sparklyr - possible to select only some columns?

问题 I've been working with sparklyr to bring large cassandra tables into spark, register these with R and conduct dplyr operations on them. I have been successfully importing cassandra tables with the code that looks like this: # import cassandra table into spark cass_df <- sparklyr:::spark_data_read_generic( sc, "org.apache.spark.sql.cassandra", "format", list(keyspace = "cass_keyspace", table = "cass_table") ) %>% invoke("load") # register table in R cass_tbl <- sparklyr:::spark_partition

sparklyr: create new column with mutate function

阅读更多关于 sparklyr: create new column with mutate function

问题 I'm very surprised if this kind of problems cannot be solved with sparklyr: iris_tbl <- copy_to(sc, aDataFrame) # date_vector is a character vector of element # in this format: YYYY-MM-DD (year, month, day) for (d in date_vector) { ... aDataFrame %>% mutate(newValue=gsub("-","",d))) ... } I receive this error: Error: org.apache.spark.sql.AnalysisException: Undefined function: 'GSUB'. This function is neither a registered temporary function nor a permanent function registered in the database

Specifying col type in Sparklyr (spark_read_csv)

阅读更多关于 Specifying col type in Sparklyr (spark_read_csv)

问题 I am reading in a csv into spark using SpraklyR schema <- structType(structField("TransTime", "array<timestamp>", TRUE), structField("TransDay", "Date", TRUE)) spark_read_csv(sc, filename, "path", infer_schema = FALSE, schema = schema) But get: Error: could not find function "structType" How do I specify colunm types using spark_read_csv? Thanks in advance. 回答1: The structType function is from Scala's SparkAPI, in Sparklyr to specify the datatype you must pass it in the "column" argument as a

change string in DF using hive command and mutate with sparklyr

阅读更多关于 change string in DF using hive command and mutate with sparklyr

Using the Hive command regexp_extract I am trying to change the following strings from: 201703170455 to 2017-03-17:04:55 and from: 2017031704555675 to 2017-03-17:04:55.0010 I am doing this in sparklyr trying to use this code that works with gsub in R: newdf<-df%>%mutate(Time1 = regexp_extract(Time, "(....)(..)(..)(..)(..)", "\\1-\\2-\\3:\\4:\\5")) and this code: newdf<-df%>mutate(TimeTrans = regexp_extract("(....)(..)(..)(..)(..)(....)", "\\1-\\2-\\3:\\4:\\5.\\6")) but does not work at all. Any suggestions of how to do this using regexp_extract? Apache Spark uses Java regular expression

How to pass variables to functions called in spark_apply()?

阅读更多关于 How to pass variables to functions called in spark_apply()?

问题 I would like to be able to pass extra variables to functions that are called by spark_apply in sparklyr. For example: # setup library(sparklyr) sc <- spark_connect(master='local', packages=TRUE) iris2 <- iris[,1:(ncol(iris) - 1)] df1 <- sdf_copy_to(sc, iris2, repartition=5, overwrite=T) # This works fine res <- spark_apply(df1, function(x) kmeans(x, 3)$centers) # This does not k <- 3 res <- spark_apply(df1, function(x) kmeans(x, k)$centers) As an ugly workaround, I can do what I want by

How to aggregate Data by 3 minutes timestamps in sparklyr?

阅读更多关于 How to aggregate Data by 3 minutes timestamps in sparklyr?

问题 I am using sparklyr for some quick analysis. I do have some issues in working with timestamps. I have two different dataframes: one with rows in 1-minute-interval and another with 3-minute-interval. First dataset: (1-minute-interval) id timefrom timeto value 10 "2017-06-06 10:30:00" "2017-06-06 10:31:00" 50 10 "2017-06-06 10:31:00" "2017-06-06 10:32:00" 80 10 "2017-06-06 10:32:00" "2017-06-06 10:33:00" 20 22 "2017-06-06 10:33:00" "2017-06-06 10:34:00" 30 22 "2017-06-06 10:34:00" "2017-06-06

Sparklyr's spark_apply function seems to run on single executor and fails on moderately-large dataset

阅读更多关于 Sparklyr's spark_apply function seems to run on single executor and fails on moderately-large dataset

I am trying to use spark_apply to run the R function below on a Spark table. This works fine if my input table is small (e.g. 5,000 rows), but after ~30 mins throws an error when the table is moderately large (e.g. 5,000,000 rows): sparklyr worker rscript failure, check worker logs for details Looking at the Spark UI shows that there is only ever a single task being created, and a single executor being applied to this task. Can anyone give advice on why this function is failing for 5 million row dataset? Could the problem be that a single executor is being made to do all the work, and failing?