SparkR collect() and head() error for Spark DataFrame: arguments imply differing number of rows

问题

I read a parquet file from HDFS system:

path<-"hdfs://part_2015"
AppDF <- parquetFile(sqlContext, path)
printSchema(AppDF)

root
 |-- app: binary (nullable = true)
 |-- category: binary (nullable = true)
 |-- date: binary (nullable = true)
 |-- user: binary (nullable = true)

class(AppDF)

[1] "DataFrame"
attr(,"package")
[1] "SparkR"

collect(AppDF)
.....error:
arguments imply differing number of rows: 46021, 39175, 62744, 27137

head(AppDF)
.....error:
arguments imply differing number of rows: 36, 30, 48

I've read some thread about this problem. But it's not my case. In fact, I just read a table from the parquet file, and head() or collect() it. My parquet table is like the following:

app   category  date        user
aaa   test      20150101    123
aaa   test      20150102    345
aaa   test      20150103    678
aaaa  testA     20150104    123
aaaa  testA     20150105    234
aaaa  testA     20150106    4345
bbbb  testB     20150101    5435

I'm using spark-1.4.0-bin-hadoop2.6 And I run this on cluster by using

./sparkR --master yarn--client

I've also tried it in local, there is the same problem.

showDF(AppDF)

+-----------+-----------+-----------+-----------+
|        app|   category|       date|       user|
+-----------+-----------+-----------+-----------+
|[B@217fa749|[B@43bfbacd|[B@60810b7a|[B@3818a815|
|[B@5ac31778|[B@3e39f5d5|[B@4f3a92dd| [B@e8013ce|
|[B@7a9440d1|[B@1b2b9836|[B@4b160f29|[B@153d7342|
|[B@7559fcf2|[B@66edb00e|[B@7ec19bec|[B@58e3e3f7|
|[B@598b9ab8|[B@5c5ad3f5|[B@4f11a931|[B@107af885|
|[B@7951ec36|[B@716b0b73|[B@2abce531|[B@576b09e2|
|[B@34560144|[B@7a6d3233|[B@16faf110|[B@34e85d39|
| [B@3406452|[B@787a4528|[B@235282e3|[B@7e0f1732|
|[B@10bc1446|[B@2bd7083f|[B@325e7695|[B@57bb4a08|
|[B@48f98037|[B@7450c04e|[B@61817c8a|[B@7c177a08|
|[B@694ce2dd|[B@36c2512d| [B@f5f7d71|[B@46248d99|
|[B@479dee25|[B@517de3de|[B@1ffb2d9e|[B@236ff079|
|[B@52ac196f|[B@20b9f0d0| [B@f70f879|[B@41c8d7da|
|[B@68d34af3| [B@7ddcd49|[B@72d077a7|[B@545fafd4|
|[B@5610b292|[B@623bbb62|[B@3f8b5150|[B@53877bc7|
|[B@63cf70a8|[B@47ed58c9|[B@2f601903|[B@4e0a2c41|
|[B@7ddf876d|[B@5e3445aa|[B@39c9cc37|[B@6f7e4c84|
|[B@4cd1a74b|[B@583e5453|[B@64124267|[B@6ac5ab84|
|[B@577f9ddf|[B@7b55c859|[B@3cd48a51|[B@25c4eb0a|
|[B@2322f0e5|[B@4af55c68|[B@3285d64a|[B@70b7ae2f|
+-----------+-----------+-----------+-----------+

I've alse tried to read this parquet file in Scala.And do a collect() operation. It seems that everything works well. So it should be an issue specific for SparkR

来源：https://stackoverflow.com/questions/31555667/sparkr-collect-and-head-error-for-spark-dataframe-arguments-imply-differing

标签

apache-spark

parquet

sparkr

spark-dataframe