How do I read a Parquet in R and convert it to an R DataFrame?

问题

I'd like to process Apache Parquet files (in my case, generated in Spark) in the R programming language.

Is an R reader available? Or is work being done on one?

If not, what would be the most expedient way to get there? Note: There are Java and C++ bindings: https://github.com/apache/parquet-mr

回答1:

You can use the arrow package for this. It is the same thing as in Python pyarrow but this nowadays also comes packaged for R without the need for Python. As it is not yet available on CRAN, you have to manually install Arrow C++ first:

git clone https://github.com/apache/arrow.git
cd arrow/cpp && mkdir release && cd release

# It is important to statically link to boost libraries
cmake .. -DARROW_PARQUET=ON -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
make install

Then you can install the R arrow package:

devtools::install_github("apache/arrow/r")

And use it to load a Parquet file

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
#> The following objects are masked from 'package:base':
#> 
#>     array, table
read_parquet("somefile.parquet", as_tibble = TRUE)
#> # A tibble: 10 x 2
#>        x       y
#>    <int>   <dbl>
#> …

回答2:

If you're using Spark then this is now relatively simple with the release of Spark 1.4 see sample code below that uses the SparkR package that is now part of the Apache Spark core framework.

# install the SparkR package
devtools::install_github('apache/spark', ref='master', subdir='R/pkg')

# load the SparkR package
library('SparkR')

# initialize sparkContext which starts a new Spark session
sc <- sparkR.init(master="local")

# initialize sqlContext
sq <- sparkRSQL.init(sc)

# load parquet file into a Spark data frame and coerce into R data frame
df <- collect(parquetFile(sq, "/path/to/filename"))

# terminate Spark session
sparkR.stop()

An expanded example is shown @ https://gist.github.com/andyjudson/6aeff07bbe7e65edc665

I'm not aware of any other package that you could use if you weren't using Spark.

回答3:

Alternatively to SparkR, you could now use sparklyr:

# install.packages("sparklyr")
library(sparklyr)

sc <- spark_connect(master = "local")

spark_tbl_handle <- spark_read_parquet(sc, "tbl_name_in_spark", "/path/to/parquetdir")

regular_df <- collect(spark_tbl_handle)

spark_disconnect(sc)

回答4:

With reticulate you can use pandas from python to parquet files. This could save you the hassle from running a spark instance.

library(reticulate)
library(dplyr)
pandas <- import("pandas")
read_parquet <- function(path, columns = NULL) {

  path <- path.expand(path)
  path <- normalizePath(path)

  if (!is.null(columns)) columns = as.list(columns)

  xdf <- pandas$read_parquet(path, columns = columns)

  xdf <- as.data.frame(xdf, stringsAsFactors = FALSE)

  dplyr::tbl_df(xdf)

}

read_parquet(PATH_TO_PARQUET_FILE)

回答5:

You can simply use the arrow package:

install.packages("arrow")
library(arrow)
read_parquet("myfile.parquet")

回答6:

Spark has been updated and there are many new things and functions which are either deprecated or renamed.

Andy's answer above is working for spark v.1.4 but on spark v.2.3 this is the update where it worked for me.

Download latest version of apache spark https://spark.apache.org/downloads.html (point 3 in the link)
extract the .tgz file.
install devtool package in rstudio
```
install.packages('devtools')
```

Open terminal and follow these steps

# This is the folder of extracted spark `.tgz` of point 1 above
export SPARK_HOME=extracted-spark-folder-path 
cd $SPARK_HOME/R/lib/SparkR/
R -e "devtools::install('.')"

Go back to rstudio

# load the SparkR package
library(SparkR)

# initialize sparkSession which starts a new Spark session
sc <- sparkR.session(master="local")

# load parquet file into a Spark data frame and coerce into R data frame
df <- collect(read.parquet('.parquet-file-path'))

# terminate Spark session
sparkR.stop()

回答7:

For reading a parquet file in an Amazon S3 bucket, try using s3a instead of s3n. That worked for me when reading parquet files using EMR 1.4.0, RStudio and Spark 1.5.0.

回答8:

miniparquet is a new dedicated package. Install with:

devtools::install_github("hannesmuehleisen/miniparquet")

Example taken from the documentation:

library(miniparquet)

f <- system.file("extdata/userdata1.parquet", package="miniparquet")
df <- parquet_read(f)
str(df)

# 'data.frame': 1000 obs. of  13 variables:
#  $ registration_dttm: POSIXct, format: "2016-02-03 07:55:29" "2016-02-03 17:04:03" "2016-02-03 01:09:31" ...
#  $ id               : int  1 2 3 4 5 6 7 8 9 10 ...
#  $ first_name       : chr  "Amanda" "Albert" "Evelyn" "Denise" ...
#  $ last_name        : chr  "Jordan" "Freeman" "Morgan" "Riley" ...
#  $ email            : chr  "ajordan0@com.com" "afreeman1@is.gd" "emorgan2@altervista.org" "driley3@gmpg.org" ...
#  $ gender           : chr  "Female" "Male" "Female" "Female" ...
#  $ ip_address       : chr  "1.197.201.2" "218.111.175.34" "7.161.136.94" "140.35.109.83" ...
#  $ cc               : chr  "6759521864920116" "" "6767119071901597" "3576031598965625" ...
#  $ country          : chr  "Indonesia" "Canada" "Russia" "China" ...
#  $ birthdate        : chr  "3/8/1971" "1/16/1968" "2/1/1960" "4/8/1997" ...
#  $ salary           : num  49757 150280 144973 90263 NA ...
#  $ title            : chr  "Internal Auditor" "Accountant IV" "Structural Engineer" "Senior Cost Accountant" ...
#  $ comments         : chr  "1E+02" "" "" "" ...

来源：https://stackoverflow.com/questions/30402253/how-do-i-read-a-parquet-in-r-and-convert-it-to-an-r-dataframe

标签

apache-spark

parquet

sparkr