What is the best way to remove accents with Apache Spark dataframes in PySpark?

前端 未结 4 945
-上瘾入骨i
-上瘾入骨i 2020-12-06 16:20

I need to delete accents from characters in Spanish and others languages from different datasets.

I already did a function based in the code provided in this post t

4条回答
  •  感情败类
    2020-12-06 17:14

    This solution is Python only, but is only useful if the number of possible accents is low (e.g. one single language like Spanish) and the character replacements are manually specified.

    There seems to be no built-in way to do what you asked for directly without UDFs, however you can chain many regexp_replace calls to replace each possible accented character. I tested the performance of this solution and it turns out that it only runs faster if you have a very limited set of accents to replace. If that's the case it can be faster than UDFs because it is optimized outside of Python.

    from pyspark.sql.functions import col, regexp_replace
    
    accent_replacements_spanish = [
        (u'á', 'a'), (u'Á', 'A'),
        (u'é', 'e'), (u'É', 'E'),
        (u'í', 'i'), (u'Í', 'I'),
        (u'ò', 'o'), (u'Ó', 'O'),
        (u'ú|ü', 'u'), (u'Ú|Ű', 'U'),
        (u'ñ', 'n'),
        # see http://stackoverflow.com/a/18123985/3810493 for other characters
    
        # this will convert other non ASCII characters to a question mark:
        ('[^\x00-\x7F]', '?') 
    ]
    
    def remove_accents(column):
        r = col(column)
        for a, b in accent_replacements_spanish:
            r = regexp_replace(r, a, b)
        return r.alias('remove_accents(' + column + ')')
    
    df = sqlContext.createDataFrame([['Olà'], ['Olé'], ['Núñez']], ['str'])
    df.select(remove_accents('str')).show()
    

    I haven't compared the performance with the other responses and this function is not as general, but it is at least worth considering because you don't need to add Scala or Java to your build process.

提交回复
热议问题