What is the best way to remove accents with Apache Spark dataframes in PySpark?

前端 未结 4 950
-上瘾入骨i
-上瘾入骨i 2020-12-06 16:20

I need to delete accents from characters in Spanish and others languages from different datasets.

I already did a function based in the code provided in this post t

4条回答
  •  一整个雨季
    2020-12-06 16:53

    Here's my implementation. Apart from accents I also remove speciach characters. Because I needed to pivot and save a table, and you can't save a table with column name that has " ,;{}()\n\t=\/" characters.

    
    import re
    
    from pyspark.sql import SparkSession
    from pyspark.sql.types import IntegerType, StringType, StructType, StructField
    from unidecode import unidecode
    
    spark = SparkSession.builder.getOrCreate()
    data = [(1, "  \\ / \\ {____} aŠdá_ \t =  \n () asd ____aa 2134_ 23_"), (1, "N"), (2, "false"), (2, "1"), (3, "NULL"),
            (3, None)]
    schema = StructType([StructField("id", IntegerType(), True), StructField("txt", StringType(), True)])
    df = SparkSession.builder.getOrCreate().createDataFrame(data, schema)
    df.show()
    
    for col_name in ["txt"]:
        tmp_dict = {}
        for col_value in [row[0] for row in df.select(col_name).distinct().toLocalIterator()
                          if row[0] is not None]:
            new_col_value = re.sub("[ ,;{}()\\n\\t=\\\/]", "_", col_value)
            new_col_value = re.sub('_+', '_', new_col_value)
            if new_col_value.startswith("_"):
                new_col_value = new_col_value[1:]
            if new_col_value.endswith("_"):
                new_col_value = new_col_value[:-1]
            new_col_value = unidecode(new_col_value)
            tmp_dict[col_value] = new_col_value.lower()
        df = df.na.replace(to_replace=tmp_dict, subset=[col_name])
    df.show()
    

    if you can't access external librares (like me) you can replace unidecode with

    new_col_value = new_col_value.translate(str.maketrans(
                        "ä,ö,ü,ẞ,á,ä,č,ď,é,ě,í,ĺ,ľ,ň,ó,ô,ŕ,š,ť,ú,ů,ý,ž,Ä,Ö,Ü,ẞ,Á,Ä,Č,Ď,É,Ě,Í,Ĺ,Ľ,Ň,Ó,Ô,Ŕ,Š,Ť,Ú,Ů,Ý,Ž",
                        "a,o,u,s,a,a,c,d,e,e,i,l,l,n,o,o,r,s,t,u,u,y,z,A,O,U,S,A,A,C,D,E,E,I,L,L,N,O,O,R,S,T,U,U,Y,Z"))
    

提交回复
热议问题