sparklyr

Permission Denied - \tmp\hive in sparklyr

为君一笑 提交于 2019-12-12 15:49:19
问题 I am trying to copy R dataframe to Spark 2.0.1 using copy_to function but it says The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw- I executed winutils.exe to change the permissions but still I get the same Exception of permissions. %HADOOP_HOME%\bin\winutils.exe chmod 777 \tmp\hive I tried other variants of the command like - %HADOOP_HOME%\bin\winutils.exe chmod 777 C:\tmp\hive %HADOOP_HOME%\bin\winutils.exe chmod 777 /tmp/hive %HADOOP_HOME%\bin

Importing multiple files in sparklyr

做~自己de王妃 提交于 2019-12-12 08:56:00
问题 I'm very new to sparklyr and spark, so please let me know if this is not the "spark" way to do this. My problem I have 50+ .txt files at around 300 mb each, all in the same folder, call it x , that I need to import to sparklyr, preferably one table. I can read them individually like spark_read_csv(path=x, sc=sc, name="mydata", delimiter = "|", header=FALSE) If I were to import them all outside of sparklyr, I would probably create a list with the file names, call it filelist and then import

dplyr to replace all variable which matches specific string

北城以北 提交于 2019-12-12 04:06:20
问题 Is there an equivalent dplyr which does this? I'm after 'replace all' which matches string xxx with NA is.na(df) <- df=="xxx" I want to execute a sparklyr command using the pipe function from R to Spark dataframe tbl(sc,"df") %>% and sticking the first script above doesn't work. 回答1: Replace "XXX" with the string you want to look for: #Using dplyr piping library(dplyr) df[] = df %>% lapply(., function(x) ifelse(grepl("XXX", x), NA, x)) #Using only the base package df[] = lapply(df, function(x

Concat_ws() function in Sparklyr is missing

♀尐吖头ヾ 提交于 2019-12-11 20:31:34
问题 I am following a tutorial on web (Adobe) analytics, where I want to build a Markov Chain Model. (http://datafeedtoolbox.com/attribution-theory-the-two-best-models-for-algorithmic-marketing-attribution-implemented-in-apache-spark-and-r/). In the example they are using the function: concat_ws (from library(sparklyr)). But it looks like the function does not exists (after installing the package, and calling the library, I receive an error that the function does not exists...). Comment author of

sparklyr spark_read_parquet Reading String Fields as Lists

北城以北 提交于 2019-12-11 17:01:00
问题 I have a number of Hive files in parquet format that contain both string and double columns. I can read most of them into a Spark Data Frame with sparklyr using the syntax below: spark_read_parquet(sc, name = "name", path = "path", memory = FALSE) However, I have one file that I read in where all of the string values get converted to unrecognizable lists that looks like this when collected into an R Data Frame and printed: s_df <- spark_read_parquet(sc, name = "s_df", path = "hdfs:/

How to unpersist in Sparklyr?

我的梦境 提交于 2019-12-11 16:06:26
问题 I am using Sparklyr for a project and have understood that persisting is very useful. I am using sdf_persist for this, with the following syntax (correct me if I am wrong): data_frame <- sdf_persist(data_frame) Now I am reaching a point where I have too many RDDs stored in memory, so I need to unpersist some. However I cannot seem to find the function to do this in Sparklyr . Note that I have tried: dplyr::db_drop_table(sc, "data_frame") dplyr::db_drop_table(sc, data_frame) unpersist(data

Changing nested column names using SparklyR in R

不问归期 提交于 2019-12-11 15:57:03
问题 I have referred to all the links mentioned here: 1) Link-1 2) Link-2 3) Link-3 4) Link-4 Following R code has been written by using Sparklyr Package. It reads huge JSON file and creates database schema. sc <- spark_connect(master = "local", config = conf, version = '2.2.0') # Connection sample_tbl <- spark_read_json(sc,name="example",path="example.json", header = TRUE, memory = FALSE, overwrite = TRUE) # reads JSON file sample_tbl <- sdf_schema_viewer(sample_tbl) # to create db schema df <-

“No suitable driver” error when using sparklyr::spark_read_jdbc to query Azure SQL database from Azure Databricks

馋奶兔 提交于 2019-12-11 15:54:00
问题 I am trying to connect to an Azure SQL DB from a Databricks notebook using the sparklyr::spark_read_jdbc function. I am an analyst with no computer science background (beyond R and SQL) or previous experience using Spark or jdbc (I have previously used local instances of R to connect to the same SQL database via odbc), so I apologise if I've misunderstood something vital. My code is: sc <- spark_connect(method = "databricks") library(sparklyr) library(dplyr) config <- spark_config() db_tbl <-

spark_apply error specifying column names

ε祈祈猫儿з 提交于 2019-12-11 15:07:55
问题 I am running sparklyr in local mode from RStudio in Windows 10: spark_version <- "2.1.0" sc <- spark_connect(master = "local", version = spark_version) df <- data.frame(id = c(1, 1, 2, 2), county_code = c(1, 20, 321, 2)) sprintf("%03d",as.numeric(df$county_code)) df_tbl = copy_to(sc,df, "df_tbl", overwrite = TRUE) df_tbl %>% summarise(sum = sum(county_code)) %>% collect() ## this works ## this does not: df_tbl %>% spark_apply(function(e) data.frame(sprintf("%03d",as.numeric(e$county_code), e)

Find missing rows by timestamp + ID with sparklyr

倾然丶 夕夏残阳落幕 提交于 2019-12-11 08:46:52
问题 I try to find missing timestamp. Here are a lot of solutions to fix this single problem. Nevertheless I also want to find "where" timestamp by ID is missing. So for example the test-dataset would look like this: elemuid timestamp 1232 2018-02-10 23:00:00 1232 2018-02-10 23:01:00 1232 2018-02-10 22:58:00 1674 2018-02-10 22:40:00 1674 2018-02-10 22:39:00 1674 2018-02-10 22:37:00 1674 2018-02-10 22:35:00 And the solution should be like: elemuid timestamp 1232 2018-02-10 22:59:00 1674 2018-02-10