sparklyr

What is difference between dataframe created using SparkR and dataframe created using Sparklyr?

可紊 提交于 2021-02-11 12:32:33
问题 I am reading a parquet file in Azure databricks: Using SparkR > read.parquet() Using Sparklyr > spark_read_parquet() Both the dataframes are different, Is there any way to convert SparkR dataframe into the sparklyr dataframe and vice-versa ? 回答1: sparklyr creates tbl_spark. This is essentially just a lazy query written in Spark SQL. SparkR creates a SparkDataFrame which is more of a collection of data that is organized using a plan. In the same way you can't use a tbl as a normal data.frame

Adding name of file when using sparklyr::spark_read_json

谁说我不能喝 提交于 2021-02-10 06:14:30
问题 I have millions of json-files, where each of the files contains the same number of columns, lets say x and y . Note that the length of x and y is equal for a single file, but could be different when comparing two different files. The problem is that the only thing that separates the data is the name of the file. So when combining the files I'd like to have the name of the file included as a third column. Is this possible using sparklyr::spark_read_json , i.e. when using wildcards? MWE:

How to read all files in S3 folder/bucket using sparklyr in R?

只谈情不闲聊 提交于 2021-02-07 17:24:30
问题 I have tried below code & its combinations in order to read all files given in a S3 folder , but nothing seems to be working .. Sensitive information/code is removed from the below script. There are 6 files each with 6.5 GB . #Spark Connection sc<-spark_connect(master = "local" , config=config) rd_1<-spark_read_csv(sc,name = "Retail_1",path = "s3a://mybucket/xyzabc/Retail_Industry/*/*",header = F,delimiter = "|") # This is the S3 bucket/folder for files [One of the file names Industry_Raw

FPGrowth/Association Rules using Sparklyr

点点圈 提交于 2021-01-29 16:32:08
问题 I am trying to build an association rules algorithm using Sparklyr and have been following this blog which is really well explained. However, there is a section just after they fit the FPGrowth algorithm where the author extracts the rules from the "FPGrowthModel object" which is returned but I am not able to reproduce to extract my rules. The section where I am struggling is this piece of code: rules = FPGmodel %>% invoke("associationRules") Could someone please explain where FPGmodel comes

Factor Analysis using sparklyr in Databricks

丶灬走出姿态 提交于 2021-01-29 06:13:50
问题 I would like to perform a Factor Analysis by using dplyr::collect() in Databricks but because of its size I am getting this error: Error : org.apache.spark.sql.execution.OutOfMemorySparkException: Total memory usage during row decode exceeds spark.driver.maxResultSize (4.0 GB). The average row size was 82.0 B Is there a function in sparklyr using which I can do this analysis without collecting the data? 来源: https://stackoverflow.com/questions/64113459/factor-analysis-using-sparklyr-in

sparklyr mutate behaviour with stringr

守給你的承諾、 提交于 2021-01-07 03:52:43
问题 I am trying to use sparklyr to process a parquet file. the table is of structure: type:str | type:str | type:str key | requestid | operation I am running the code: txt %>% select(key, requestid, operation) %>% mutate(object = stringr::str_split(key, '/', simplify=TRUE) %>% dplyr::last() ) where txt is a valid spark frame I get: Error in stri_split_regex(string, pattern, n = n, simplify = simplify, : object 'key' not found Traceback: 1. txt2 %>% select(key, requestid, operation) %>% mutate

Convert spark dataframe to sparklyR table “tbl_spark”

微笑、不失礼 提交于 2020-12-02 08:00:18
问题 I'm trying to convert spark dataframe org.apache.spark.sql.DataFrame to a sparklyr table tbl_spark . I tried with sdf_register , but it failed with following error. In here, df is spark dataframe. sdf_register(df, name = "my_tbl") error is, Error: org.apache.spark.sql.AnalysisException: Table not found: my_tbl; line 2 pos 17 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$

Convert spark dataframe to sparklyR table “tbl_spark”

折月煮酒 提交于 2020-12-02 07:56:48
问题 I'm trying to convert spark dataframe org.apache.spark.sql.DataFrame to a sparklyr table tbl_spark . I tried with sdf_register , but it failed with following error. In here, df is spark dataframe. sdf_register(df, name = "my_tbl") error is, Error: org.apache.spark.sql.AnalysisException: Table not found: my_tbl; line 2 pos 17 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$

Convert spark dataframe to sparklyR table “tbl_spark”

被刻印的时光 ゝ 提交于 2020-12-02 07:56:07
问题 I'm trying to convert spark dataframe org.apache.spark.sql.DataFrame to a sparklyr table tbl_spark . I tried with sdf_register , but it failed with following error. In here, df is spark dataframe. sdf_register(df, name = "my_tbl") error is, Error: org.apache.spark.sql.AnalysisException: Table not found: my_tbl; line 2 pos 17 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$