How to create new DataFrame with dict

坚强是说给别人听的谎言 提交于 2019-12-21 18:57:13

问题


I had one dict, like:

cMap = {"k1" : "v1", "k2" : "v1", "k3" : "v2", "k4" : "v2"}

and one DataFrame A, like:

+---+
|key|
+----
| k1|
| k2|
| k3|
| k4|
+---+

to create the DataFame above with code:

data = [('k1'),
    ('k2'),
    ('k3'),
    ('k4')]
A = spark.createDataFrame(data, ['key'])

I want to get the new DataFrame, like:

+---+----------+----------+
|key|   v1     |    v2    |
+---+----------+----------+
| k1|true      |false     |
| k2|true      |false     |
| k3|false     |true      |
| k4|false     |true      |
+---+----------+----------+

I wish to get some suggestions, thanks!


回答1:


I just wanted to contribute a different and possibly easier way to solve this.

In my code I convert a dict to a pandas dataframe, which I find is much easier. Then I directly convert the pandas dataframe to spark.

data = {'visitor': ['foo', 'bar', 'jelmer'], 
        'A': [0, 1, 0],
        'B': [1, 0, 1],
        'C': [1, 0, 0]}

df = pd.DataFrame(data)
ddf = spark.createDataFrame(df)

Output:
+---+---+---+-------+
|  A|  B|  C|visitor|
+---+---+---+-------+
|  0|  1|  1|    foo|
|  1|  0|  0|    bar|
|  0|  1|  0| jelmer|
+---+---+---+-------+



回答2:


The dictionary can be converted to dataframe and joined with other one. My piece of code,

data = sc.parallelize([(k,)+(v,) for k,v in cMap.items()]).toDF(['key','val'])
keys = sc.parallelize([('k1',),('k2',),('k3',),('k4',)]).toDF(["key"])
newDF = data.join(keys,'key').select("key",F.when(F.col("val") == "v1","True").otherwise("False").alias("v1"),F.when(F.col("val") == "v2","True").otherwise("False").alias("v2"))

 >>> newDF.show()
 +---+-----+-----+
 |key|   v1|   v2|
 +---+-----+-----+
 | k1| True|False|
 | k2| True|False|
 | k3|False| True|
 | k4|False| True|
 +---+-----+-----+

If there are more values, you can code that when clause as a UDF and use it.




回答3:


I parallelize cMap.items() and check if value equal to v1 or v2 or not. Then joining back to dataframe A on column key

# example dataframe A
df_A = spark.sparkContext.parallelize(['k1', 'k2', 'k3', 'k4']).map(lambda x: Row(**{'key': x})).toDF()

cmap_rdd = spark.sparkContext.parallelize(cMap.items())
cmap_df = cmap_rdd.map(lambda x: Row(**dict([('key', x[0]), ('v1', x[1]=='v1'), ('v2', x[1]=='v2')]))).toDF()

df_A.join(cmap_df, on='key').orderBy('key').show()

Dataframe

+---+-----+-----+
|key|   v1|   v2|
+---+-----+-----+
| k1| true|false|
| k2| true|false|
| k3|false| true|
| k4|false| true|
+---+-----+-----+



回答4:


Thanks everyone for some suggestions, I figured out the other way to resolve my problem with pivot, the code is:

cMap = {"k1" : "v1", "k2" : "v1", "k3" : "v2", "k4" : "v2"}
a_cMap = [(k,)+(v,) for k,v in cMap.items()] 
data = spark.createDataFrame(a_cMap, ['key','val'])

from pyspark.sql.functions import count
data = data.groupBy('key').pivot('val').agg(count('val'))
data.show()

+---+----+----+
|key|  v1|  v2|
+---+----+----+
| k2|   1|null|
| k4|null|   1|
| k1|   1|null|
| k3|null|   1|
+---+----+----+

data = data.na.fill(0)
data.show()

+---+---+---+
|key| v1| v2|
+---+---+---+
| k2|  1|  0|
| k4|  0|  1|
| k1|  1|  0|
| k3|  0|  1|
+---+---+---+

keys = spark.createDataFrame([('k1','2'),('k2','3'),('k3','4'),('k4','5'),('k5','6')], ["key",'temp'])

newDF = keys.join(data,'key')
newDF.show()
+---+----+---+---+
|key|temp| v1| v2|
+---+----+---+---+
| k2|   3|  1|  0|
| k4|   5|  0|  1|
| k1|   2|  1|  0|
| k3|   4|  0|  1|
+---+----+---+---+

But, I can't convert 1 to true, 0 to false.




回答5:


I just wanted to add an easy way to create DF, using pyspark

values = [("K1","true","false),("K2","true","false)]
columns = ['Key', 'V1', 'V2']
df = spark.createDataFrame(values, columns)


来源:https://stackoverflow.com/questions/43751509/how-to-create-new-dataframe-with-dict

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!