Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently
From https://issues.apache.org/jira/browse/SPARK-6235
Support for parallelizing R data.frame larger than 2GB
is resolved.
From https://pandas.pydata.org/pandas-docs/stable/r_interface.html
Converting DataFrames into R objects
you can convert a pandas dataframe to an R data.frame
So perhaps the transformation pandas -> R -> Spark -> hdfs?