问题
I have pyspark dataframe with a column named Filters: "array>"
I want to save my dataframe in csv file, for that i need to cast the array to string type.
I tried to cast it: DF.Filters.tostring()
and DF.Filters.cast(StringType())
, but both solutions generate error message for each row in the columns Filters:
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@56234c19
The code is as follows
from pyspark.sql.types import StringType
DF.printSchema()
|-- ClientNum: string (nullable = true)
|-- Filters: array (nullable = true)
|-- element: struct (containsNull = true)
|-- Op: string (nullable = true)
|-- Type: string (nullable = true)
|-- Val: string (nullable = true)
DF_cast = DF.select ('ClientNum',DF.Filters.cast(StringType()))
DF_cast.printSchema()
|-- ClientNum: string (nullable = true)
|-- Filters: string (nullable = true)
DF_cast.show()
| ClientNum | Filters
| 32103 | org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@d9e517ce
| 218056 | org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@3c744494
Sample JSON data:
{"ClientNum":"abc123","Filters":[{"Op":"foo","Type":"bar","Val":"baz"}]}
Thanks !!
回答1:
I created a sample JSON dataset to match that schema:
{"ClientNum":"abc123","Filters":[{"Op":"foo","Type":"bar","Val":"baz"}]}
select(s.col("ClientNum"),s.col("Filters").cast(StringType)).show(false)
+---------+------------------------------------------------------------------+
|ClientNum|Filters |
+---------+------------------------------------------------------------------+
|abc123 |org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@60fca57e|
+---------+------------------------------------------------------------------+
Your problem is best solved using the explode() function which flattens an array, then the star expand notation:
s.selectExpr("explode(Filters) AS structCol").selectExpr("structCol.*").show()
+---+----+---+
| Op|Type|Val|
+---+----+---+
|foo| bar|baz|
+---+----+---+
To make it a single column string separated by commas:
s.selectExpr("explode(Filters) AS structCol").select(F.expr("concat_ws(',', structCol.*)").alias("single_col")).show()
+-----------+
| single_col|
+-----------+
|foo,bar,baz|
+-----------+
Explode Array reference: Flattening Rows in Spark
Star expand reference for "struct" type: How to flatten a struct in a spark dataframe?
回答2:
For me in Pyspark the function to_json() did the job.
As a plus compared to the simple casting to String, it keeps the "struct keys" as well (not only the "struct values"). So for the reported example I would have something like:
[{"Op":"foo","Type":"bar","Val":"baz"}]
This was much more useful to me since that I had to write results to a Postgres table. In this format I can easily use supported JSON functions in Postgres
回答3:
You can try this:
DF = DF.withColumn('Filters', DF.Filters.cast("string"))
来源:https://stackoverflow.com/questions/43347098/pyspark-cast-array-with-nested-struct-to-string