Loading data from on-premises hdfs to local SparkR | 易学教程

问题

I'm trying to load data from an on-premises hdfs to R-Studio with SparkR.

When I do this:

 df_hadoop <- read.df(sqlContext, "hdfs://xxx.xx.xxx.xxx:xxxx/user/lam/lamr_2014_09.csv",
              source = "com.databricks.spark.csv")

and then this:

str(df_hadoop)

I get this:

Formal class 'DataFrame' [package "SparkR"] with 2 slots 
..@ env: <environment: 0x000000000xxxxxxx>  
..@ sdf:Class 'jobj' <environment: 0x000000000xxxxxx>

This is not however the df I'm looking for, because there are 13 fields in the csv I'm trying to load from hdfs.

I have a schema with the 13 fields of the csv, but where or how do I tell it to SparkR?

回答1:

If you try the following:

df <- createDataFrame(sqlContext,
                      data.frame(a=c(1,2,3),
                                 b=c(2,3,4),
                                 c=c(3,4,5)))

str(df)

You as well get

Formal class 'DataFrame' [package "SparkR"] with 2 slots
  ..@ env:<environment: 0x139235d18> 
  ..@ sdf:Class 'jobj' <environment: 0x139230e68>

Str() does show you the representation of df, which is a pointer instead of a data.frame. Rather just use

df

or

show(df)

来源：https://stackoverflow.com/questions/33211140/loading-data-from-on-premises-hdfs-to-local-sparkr

标签

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!