Efficient string suffix detection

前端未结

关注

 2  624

一生所求 2020-12-10 16:39

I am working with PySpark on a huge dataset, where I want to filter the data frame based on strings in another data frame. For example,

dd = spark.createData


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   情书的邮戳
                                             
                
                
                (楼主)
            
              
              
                2020-12-10 17:18
              

            
            
                        


If I understand correctly, you just want a left anti join using a simple SQL string matching pattern.

from pyspark.sql.functions import expr

dd.alias("l")\
    .join(
        dd1.alias("r"), 
        on=expr("l.domains LIKE concat('%', r.gooddomains)"), 
        how="leftanti"
    )\
    .select("l.*")\
    .show(truncate=False)
#+----------------------------------------+
#|domains                                 |
#+----------------------------------------+
#|something.google.com.somethingelse.ac.uk|
#|something.good.com.cy.mal.org           |
#+----------------------------------------+


The expression concat('%', r.gooddomains) prepends a wildcard to r.gooddomains. 

Next, we use l.domains LIKE concat('%', r.gooddomains) to find the rows which match this pattern. 

Finally, specify how="leftanti" in order to keep only the rows that don't match.



Update: As pointed out in the comments by @user10938362 there are 2 flaws with this approach:

1) Since this only looks at matching suffixes, there are edge cases where this produces the wrong results. For example:


  example.com should match example.com and subdomain.example.com, but not fakeexample.com


There are two ways to approach this. The first is to modify the LIKE expression to handle this. Since we know these are all valid domains, we can check for an exact match or a dot followed by the domain:

like_expr = " OR ".join(
    [
        "(l.domains = r.gooddomains)",
        "(l.domains LIKE concat('%.', r.gooddomains))"
    ]
)

dd.alias("l")\
    .join(
        dd1.alias("r"), 
        on=expr(like_expr), 
        how="leftanti"
    )\
    .select("l.*")\
    .show(truncate=False)


Similarly, one can use RLIKE with a regular expression pattern with a look-behind.

2) The larger issue is that, as explained here, joining on a LIKE expression will cause a Cartesian Product. If dd1 is small enough to be broadcast, then this isn't an issue. 

Otherwise, you may run into performance issues and will have to try a different approach.



More on the PySparkSQL LIKE operator from the Apache HIVE docs:

A LIKE B:


  TRUE if string A matches the SQL simple regular expression B, otherwise FALSE. The comparison is done character by character. The _ character in B matches any character in A (similar to . in posix regular expressions), and the % character in B matches an arbitrary number of characters in A (similar to .* in posix regular expressions). For example, 'foobar' LIKE 'foo' evaluates to FALSE where as 'foobar' LIKE 'foo___' evaluates to TRUE and so does 'foobar' LIKE 'foo%'. To escape % use \ (% matches one % character). If the data contains a semicolon, and you want to search for it, it needs to be escaped, columnValue LIKE 'a\;b'




Note: This exploits the "trick" of using pyspark.sql.functions.expr to pass in a column value as a parameter to a function. 
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复