sparklyr | 易学教程

Permission Denied - \tmp\hive in sparklyr

阅读更多关于 Permission Denied - \tmp\hive in sparklyr

问题 I am trying to copy R dataframe to Spark 2.0.1 using copy_to function but it says The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw- I executed winutils.exe to change the permissions but still I get the same Exception of permissions. %HADOOP_HOME%\bin\winutils.exe chmod 777 \tmp\hive I tried other variants of the command like - %HADOOP_HOME%\bin\winutils.exe chmod 777 C:\tmp\hive %HADOOP_HOME%\bin\winutils.exe chmod 777 /tmp/hive %HADOOP_HOME%\bin

Importing multiple files in sparklyr

阅读更多关于 Importing multiple files in sparklyr

问题 I'm very new to sparklyr and spark, so please let me know if this is not the "spark" way to do this. My problem I have 50+ .txt files at around 300 mb each, all in the same folder, call it x , that I need to import to sparklyr, preferably one table. I can read them individually like spark_read_csv(path=x, sc=sc, name="mydata", delimiter = "|", header=FALSE) If I were to import them all outside of sparklyr, I would probably create a list with the file names, call it filelist and then import

dplyr to replace all variable which matches specific string

阅读更多关于 dplyr to replace all variable which matches specific string

问题 Is there an equivalent dplyr which does this? I'm after 'replace all' which matches string xxx with NA is.na(df) <- df=="xxx" I want to execute a sparklyr command using the pipe function from R to Spark dataframe tbl(sc,"df") %>% and sticking the first script above doesn't work. 回答1: Replace "XXX" with the string you want to look for: #Using dplyr piping library(dplyr) df[] = df %>% lapply(., function(x) ifelse(grepl("XXX", x), NA, x)) #Using only the base package df[] = lapply(df, function(x

Concat_ws() function in Sparklyr is missing

阅读更多关于 Concat_ws() function in Sparklyr is missing

问题 I am following a tutorial on web (Adobe) analytics, where I want to build a Markov Chain Model. (http://datafeedtoolbox.com/attribution-theory-the-two-best-models-for-algorithmic-marketing-attribution-implemented-in-apache-spark-and-r/). In the example they are using the function: concat_ws (from library(sparklyr)). But it looks like the function does not exists (after installing the package, and calling the library, I receive an error that the function does not exists...). Comment author of

sparklyr spark_read_parquet Reading String Fields as Lists

阅读更多关于 sparklyr spark_read_parquet Reading String Fields as Lists

问题 I have a number of Hive files in parquet format that contain both string and double columns. I can read most of them into a Spark Data Frame with sparklyr using the syntax below: spark_read_parquet(sc, name = "name", path = "path", memory = FALSE) However, I have one file that I read in where all of the string values get converted to unrecognizable lists that looks like this when collected into an R Data Frame and printed: s_df <- spark_read_parquet(sc, name = "s_df", path = "hdfs:/

How to unpersist in Sparklyr?

阅读更多关于 How to unpersist in Sparklyr?

问题 I am using Sparklyr for a project and have understood that persisting is very useful. I am using sdf_persist for this, with the following syntax (correct me if I am wrong): data_frame <- sdf_persist(data_frame) Now I am reaching a point where I have too many RDDs stored in memory, so I need to unpersist some. However I cannot seem to find the function to do this in Sparklyr . Note that I have tried: dplyr::db_drop_table(sc, "data_frame") dplyr::db_drop_table(sc, data_frame) unpersist(data

Changing nested column names using SparklyR in R

阅读更多关于 Changing nested column names using SparklyR in R

问题 I have referred to all the links mentioned here: 1) Link-1 2) Link-2 3) Link-3 4) Link-4 Following R code has been written by using Sparklyr Package. It reads huge JSON file and creates database schema. sc <- spark_connect(master = "local", config = conf, version = '2.2.0') # Connection sample_tbl <- spark_read_json(sc,name="example",path="example.json", header = TRUE, memory = FALSE, overwrite = TRUE) # reads JSON file sample_tbl <- sdf_schema_viewer(sample_tbl) # to create db schema df <-

“No suitable driver” error when using sparklyr::spark_read_jdbc to query Azure SQL database from Azure Databricks

阅读更多关于 “No suitable driver” error when using sparklyr::spark_read_jdbc to query Azure SQL database from Azure Databricks

问题 I am trying to connect to an Azure SQL DB from a Databricks notebook using the sparklyr::spark_read_jdbc function. I am an analyst with no computer science background (beyond R and SQL) or previous experience using Spark or jdbc (I have previously used local instances of R to connect to the same SQL database via odbc), so I apologise if I've misunderstood something vital. My code is: sc <- spark_connect(method = "databricks") library(sparklyr) library(dplyr) config <- spark_config() db_tbl <-

spark_apply error specifying column names

阅读更多关于 spark_apply error specifying column names

问题 I am running sparklyr in local mode from RStudio in Windows 10: spark_version <- "2.1.0" sc <- spark_connect(master = "local", version = spark_version) df <- data.frame(id = c(1, 1, 2, 2), county_code = c(1, 20, 321, 2)) sprintf("%03d",as.numeric(df$county_code)) df_tbl = copy_to(sc,df, "df_tbl", overwrite = TRUE) df_tbl %>% summarise(sum = sum(county_code)) %>% collect() ## this works ## this does not: df_tbl %>% spark_apply(function(e) data.frame(sprintf("%03d",as.numeric(e$county_code), e)

Find missing rows by timestamp + ID with sparklyr

阅读更多关于 Find missing rows by timestamp + ID with sparklyr

问题 I try to find missing timestamp. Here are a lot of solutions to fix this single problem. Nevertheless I also want to find "where" timestamp by ID is missing. So for example the test-dataset would look like this: elemuid timestamp 1232 2018-02-10 23:00:00 1232 2018-02-10 23:01:00 1232 2018-02-10 22:58:00 1674 2018-02-10 22:40:00 1674 2018-02-10 22:39:00 1674 2018-02-10 22:37:00 1674 2018-02-10 22:35:00 And the solution should be like: elemuid timestamp 1232 2018-02-10 22:59:00 1674 2018-02-10