问题
I have a DataFrame in PySpark that has a nested array value for one of its fields. I would like to filter the DataFrame where the array contains a certain string. I'm not seeing how I can do that.
The schema looks like this:
root
|-- name: string (nullable = true)
|-- lastName: array (nullable = true)
| |-- element: string (containsNull = false)
I want to return all the rows where the upper(name) == 'JOHN'
and where the lastName
column (the array) contains 'SMITH'
and the equality there should be case insensitive (like I did for the name). I found the isin()
function on a column value, but that seems to work backwards of what I want. It seem like I need a contains()
function on a column value. Anyone have any ideas for a straightforward way to do this?
回答1:
An update in 2019
spark 2.4.0 introduced new functions like array_contains
and transform
official document
now it can be done in sql language
For your problem, it should be
dataframe.filter('array_contains(transform(lastName, x -> upper(x)), "JOHN")')
It is better than the previous solution using RDD
as a bridge, because DataFrame
operations are much faster than RDD
ones.
回答2:
You could consider working on the underlying RDD directly.
def my_filter(row):
if row.name.upper() == 'JOHN':
for it in row.lastName:
if it.upper() == 'SMITH':
yield row
dataframe = dataframe.rdd.flatMap(my_filter).toDF()
来源:https://stackoverflow.com/questions/38019921/pyspark-dataframes-filter-where-some-value-is-in-array-column