Efficient string suffix detection

前端 未结 2 624
一生所求
一生所求 2020-12-10 16:39

I am working with PySpark on a huge dataset, where I want to filter the data frame based on strings in another data frame. For example,

dd = spark.createData         


        
2条回答
  •  情书的邮戳
    2020-12-10 17:18

    If I understand correctly, you just want a left anti join using a simple SQL string matching pattern.

    from pyspark.sql.functions import expr
    
    dd.alias("l")\
        .join(
            dd1.alias("r"), 
            on=expr("l.domains LIKE concat('%', r.gooddomains)"), 
            how="leftanti"
        )\
        .select("l.*")\
        .show(truncate=False)
    #+----------------------------------------+
    #|domains                                 |
    #+----------------------------------------+
    #|something.google.com.somethingelse.ac.uk|
    #|something.good.com.cy.mal.org           |
    #+----------------------------------------+
    

    The expression concat('%', r.gooddomains) prepends a wildcard to r.gooddomains.

    Next, we use l.domains LIKE concat('%', r.gooddomains) to find the rows which match this pattern.

    Finally, specify how="leftanti" in order to keep only the rows that don't match.


    Update: As pointed out in the comments by @user10938362 there are 2 flaws with this approach:

    1) Since this only looks at matching suffixes, there are edge cases where this produces the wrong results. For example:

    example.com should match example.com and subdomain.example.com, but not fakeexample.com

    There are two ways to approach this. The first is to modify the LIKE expression to handle this. Since we know these are all valid domains, we can check for an exact match or a dot followed by the domain:

    like_expr = " OR ".join(
        [
            "(l.domains = r.gooddomains)",
            "(l.domains LIKE concat('%.', r.gooddomains))"
        ]
    )
    
    dd.alias("l")\
        .join(
            dd1.alias("r"), 
            on=expr(like_expr), 
            how="leftanti"
        )\
        .select("l.*")\
        .show(truncate=False)
    

    Similarly, one can use RLIKE with a regular expression pattern with a look-behind.

    2) The larger issue is that, as explained here, joining on a LIKE expression will cause a Cartesian Product. If dd1 is small enough to be broadcast, then this isn't an issue.

    Otherwise, you may run into performance issues and will have to try a different approach.


    More on the PySparkSQL LIKE operator from the Apache HIVE docs:

    A LIKE B:

    TRUE if string A matches the SQL simple regular expression B, otherwise FALSE. The comparison is done character by character. The _ character in B matches any character in A (similar to . in posix regular expressions), and the % character in B matches an arbitrary number of characters in A (similar to .* in posix regular expressions). For example, 'foobar' LIKE 'foo' evaluates to FALSE where as 'foobar' LIKE 'foo___' evaluates to TRUE and so does 'foobar' LIKE 'foo%'. To escape % use \ (% matches one % character). If the data contains a semicolon, and you want to search for it, it needs to be escaped, columnValue LIKE 'a\;b'


    Note: This exploits the "trick" of using pyspark.sql.functions.expr to pass in a column value as a parameter to a function.

提交回复
热议问题