问题
I'm trying to load data from an on-premises hdfs to R-Studio with SparkR.
When I do this:
df_hadoop <- read.df(sqlContext, "hdfs://xxx.xx.xxx.xxx:xxxx/user/lam/lamr_2014_09.csv",
source = "com.databricks.spark.csv")
and then this:
str(df_hadoop)
I get this:
Formal class 'DataFrame' [package "SparkR"] with 2 slots
..@ env: <environment: 0x000000000xxxxxxx>
..@ sdf:Class 'jobj' <environment: 0x000000000xxxxxx>
This is not however the df I'm looking for, because there are 13 fields in the csv I'm trying to load from hdfs.
I have a schema with the 13 fields of the csv, but where or how do I tell it to SparkR?
回答1:
If you try the following:
df <- createDataFrame(sqlContext,
data.frame(a=c(1,2,3),
b=c(2,3,4),
c=c(3,4,5)))
str(df)
You as well get
Formal class 'DataFrame' [package "SparkR"] with 2 slots
..@ env:<environment: 0x139235d18>
..@ sdf:Class 'jobj' <environment: 0x139230e68>
Str() does show you the representation of df, which is a pointer instead of a data.frame. Rather just use
df
or
show(df)
来源:https://stackoverflow.com/questions/33211140/loading-data-from-on-premises-hdfs-to-local-sparkr