How to efficiently check if a list of words is contained in a Spark Dataframe?

后端未结

关注

 2  1089

Using PySpark dataframes I\'m trying to do the following as efficiently as possible. I have a dataframe with a column which contains text and a list of words I want to filte

相关标签:

2条回答

温柔的废话

2020-12-17 06:31

You should consider using pyspark sql module functions instead of writing a UDF, there are several regexp based functions:

First let's start with a more complete sample data frame:

df = sc.parallelize([["a","b","foo is tasty"],["12","34","blah blahhh"],["yeh","0","bar of yums"], 
                     ['haha', '1', 'foobar none'], ['hehe', '2', 'something bar else']])\
    .toDF(["col1","col2","col_with_text"])

If you want to filter lines based on whether they contain one of the words in words_list, you can use rlike:

import pyspark.sql.functions as psf
words_list = ['foo','bar']
df.filter(psf.col('col_with_text').rlike('(^|\s)(' + '|'.join(words_list) + ')(\s|$)')).show()

    +----+----+------------------+
    |col1|col2|     col_with_text|
    +----+----+------------------+
    |   a|   b|      foo is tasty|
    | yeh|   0|       bar of yums|
    |hehe|   2|something bar else|
    +----+----+------------------+

If you want to extract the strings matching the regular expression, you can use regexp_extract:

df.withColumn(
        'extracted_word', 
        psf.regexp_extract('col_with_text', '(?=^|\s)(' + '|'.join(words_list) + ')(?=\s|$)', 0))\
    .show()

    +----+----+------------------+--------------+
    |col1|col2|     col_with_text|extracted_word|
    +----+----+------------------+--------------+
    |   a|   b|      foo is tasty|           foo|
    |  12|  34|       blah blahhh|              |
    | yeh|   0|       bar of yums|           bar|
    |haha|   1|       foobar none|              |
    |hehe|   2|something bar else|              |
    +----+----+------------------+--------------+

0 讨论(0)

春和景丽

2020-12-17 06:42

Well I have tried this and if you change the word list.

words_list = ['foo', 'is', 'bar']

The result remains the same and it doesn't show the other words.

+----+----+------------------+--------------+ |col1|col2| col_with_text|extracted_word| +----+----+------------------+--------------+ | a| b| foo is tasty| foo| | 12| 34| blah blahhh| | | yeh| 0| bar of yums| bar| |haha| 1| foobar none| | |hehe| 2|something bar else| | +----+----+------------------+--------------+

0 讨论(0)
发布评论:

提交评论
- 加载中...