How do I read a Parquet in R and convert it to an R DataFrame?

后端 未结 9 1335
北荒
北荒 2020-12-28 13:04

I\'d like to process Apache Parquet files (in my case, generated in Spark) in the R programming language.

Is an R reader available? Or is work being done on one?

9条回答
  •  独厮守ぢ
    2020-12-28 13:18

    Spark has been updated and there are many new things and functions which are either deprecated or renamed.

    Andy's answer above is working for spark v.1.4 but on spark v.2.3 this is the update where it worked for me.

    1. Download latest version of apache spark https://spark.apache.org/downloads.html (point 3 in the link)

    2. extract the .tgz file.

    3. install devtool package in rstudio

      install.packages('devtools')
      
    4. Open terminal and follow these steps

      # This is the folder of extracted spark `.tgz` of point 1 above
      export SPARK_HOME=extracted-spark-folder-path 
      cd $SPARK_HOME/R/lib/SparkR/
      R -e "devtools::install('.')"
      
    5. Go back to rstudio

      # load the SparkR package
      library(SparkR)
      
      # initialize sparkSession which starts a new Spark session
      sc <- sparkR.session(master="local")
      
      # load parquet file into a Spark data frame and coerce into R data frame
      df <- collect(read.parquet('.parquet-file-path'))
      
      # terminate Spark session
      sparkR.stop()
      

提交回复
热议问题