Spark RDD to DataFrame python

匿名 (未验证) 提交于 2019-12-03 01:52:01

问题:

I am trying to convert the Spark RDD to a DataFrame. I have seen the documentation and example where the scheme is passed to sqlContext.CreateDataFrame(rdd,schema) function.

But I have 38 columns or fields and this will increase further. If I manually give the schema specifying each field information, that it going to be so tedious job.

Is there any other way to specify the schema without knowing the information of the columns prior.

回答1:

See,

There are two ways to convert an RDD to DF in Spark.

toDF() and createDataFrame(rdd, schema)

I will show you how you can do that dynamically.

toDF()

The toDF() command gives you the way to convert an RDD[Row] to a Dataframe. The point is, the object Row() can receive a **kwargs argument. So, there is an easy way to do that.

from pyspark.sql.types import Row  #here you are going to create a function def f(x):     d = {}     for i in range(len(x)):         d[str(i)] = x[i]     return d  #Now populate that df = rdd.map(lambda x: Row(**f(x))).toDF() 

This way you are going to be able to create a dataframe dynamically.

createDataFrame(rdd, schema)

Other way to do that is creating a dynamic schema. How?

This way:

from pyspark.sql.types import StructType from pyspark.sql.types import StructField from pyspark.sql.types import StringType  schema = StructType([StructField(str(i), StringType(), True) for i in range(32)])  df = sqlContext.createDataFrame(rdd, schema) 

This second way is cleaner to do that...

So this is how you can create dataframes dynamically.



易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!