How to efficiently check if a list of words is contained in a Spark Dataframe?

后端 未结 2 1089
春和景丽
春和景丽 2020-12-17 06:07

Using PySpark dataframes I\'m trying to do the following as efficiently as possible. I have a dataframe with a column which contains text and a list of words I want to filte

相关标签:
2条回答
  • 2020-12-17 06:31

    You should consider using pyspark sql module functions instead of writing a UDF, there are several regexp based functions:

    First let's start with a more complete sample data frame:

    df = sc.parallelize([["a","b","foo is tasty"],["12","34","blah blahhh"],["yeh","0","bar of yums"], 
                         ['haha', '1', 'foobar none'], ['hehe', '2', 'something bar else']])\
        .toDF(["col1","col2","col_with_text"])
    

    If you want to filter lines based on whether they contain one of the words in words_list, you can use rlike:

    import pyspark.sql.functions as psf
    words_list = ['foo','bar']
    df.filter(psf.col('col_with_text').rlike('(^|\s)(' + '|'.join(words_list) + ')(\s|$)')).show()
    
        +----+----+------------------+
        |col1|col2|     col_with_text|
        +----+----+------------------+
        |   a|   b|      foo is tasty|
        | yeh|   0|       bar of yums|
        |hehe|   2|something bar else|
        +----+----+------------------+
    

    If you want to extract the strings matching the regular expression, you can use regexp_extract:

    df.withColumn(
            'extracted_word', 
            psf.regexp_extract('col_with_text', '(?=^|\s)(' + '|'.join(words_list) + ')(?=\s|$)', 0))\
        .show()
    
        +----+----+------------------+--------------+
        |col1|col2|     col_with_text|extracted_word|
        +----+----+------------------+--------------+
        |   a|   b|      foo is tasty|           foo|
        |  12|  34|       blah blahhh|              |
        | yeh|   0|       bar of yums|           bar|
        |haha|   1|       foobar none|              |
        |hehe|   2|something bar else|              |
        +----+----+------------------+--------------+
    
    0 讨论(0)
  • 2020-12-17 06:42

    Well I have tried this and if you change the word list.

    words_list = ['foo', 'is', 'bar']

    The result remains the same and it doesn't show the other words.

    +----+----+------------------+--------------+ |col1|col2| col_with_text|extracted_word| +----+----+------------------+--------------+ | a| b| foo is tasty| foo| | 12| 34| blah blahhh| | | yeh| 0| bar of yums| bar| |haha| 1| foobar none| | |hehe| 2|something bar else| | +----+----+------------------+--------------+

    0 讨论(0)
提交回复
热议问题