问题
I have a Spark DataFrame df
with the following Schema:
root
|-- k: integer (nullable = false)
|-- v: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = false)
| | |-- b: double (nullable = false)
| | |-- c: string (nullable = true)
Is it possible to just select a, c
in v
from df
without doing a map
? In particular, df
is loaded from a Parquet file and I don't want the values for c
to even be loaded/read.
回答1:
It depends on exactly what you expect as an output, which is not clear from your question. Let me clarify. You can do
df.select($"v.a",$"v.b").show()
however, the result may be not what you want, since v
is an array, it will yield an array for a and one per b. What you may want to do is explode
the array v then select from the exploded dataframe:
df.select(explode($"v").as("v" :: Nil )).select($"v.a", $"v.b").show()
this will flatten v to a table with all its values flattened. In either case, spark/parquet should be smart enough to use predicate push down and not load c at all.
来源:https://stackoverflow.com/questions/37172254/select-specific-columns-in-spark-dataframes-from-array-of-struct