I\'d like to process Apache Parquet files (in my case, generated in Spark) in the R programming language.
Is an R reader available? Or is work being done on one?
Spark has been updated and there are many new things and functions which are either deprecated or renamed.
Andy's answer above is working for spark v.1.4 but on spark v.2.3 this is the update where it worked for me.
Download latest version of apache spark https://spark.apache.org/downloads.html (point 3 in the link)
extract the .tgz
file.
install devtool
package in rstudio
install.packages('devtools')
Open terminal
and follow these steps
# This is the folder of extracted spark `.tgz` of point 1 above
export SPARK_HOME=extracted-spark-folder-path
cd $SPARK_HOME/R/lib/SparkR/
R -e "devtools::install('.')"
Go back to rstudio
# load the SparkR package
library(SparkR)
# initialize sparkSession which starts a new Spark session
sc <- sparkR.session(master="local")
# load parquet file into a Spark data frame and coerce into R data frame
df <- collect(read.parquet('.parquet-file-path'))
# terminate Spark session
sparkR.stop()