Filtering DataFrame using the length of a column

前端 未结 3 1433
耶瑟儿~
耶瑟儿~ 2020-12-02 22:28

I want to filter a DataFrame using a condition related to the length of a column, this question might be very easy but I didn\'t find any related question in th

3条回答
  •  情深已故
    2020-12-02 23:05

    In Spark >= 1.5 you can use size function:

    from pyspark.sql.functions import col, size
    
    df = sqlContext.createDataFrame([
        (["L", "S", "Y", "S"],  ),
        (["L", "V", "I", "S"],  ),
        (["I", "A", "N", "A"],  ),
        (["I", "L", "S", "A"],  ),
        (["E", "N", "N", "Y"],  ),
        (["E", "I", "M", "A"],  ),
        (["O", "A", "N", "A"],  ),
        (["S", "U", "S"],  )], 
        ("tokens", ))
    
    df.where(size(col("tokens")) <= 3).show()
    
    ## +---------+
    ## |   tokens|
    ## +---------+
    ## |[S, U, S]|
    ## +---------+
    

    In Spark < 1.5 an UDF should do the trick:

    from pyspark.sql.types import IntegerType
    from pyspark.sql.functions import udf
    
    size_ = udf(lambda xs: len(xs), IntegerType())
    
    df.where(size_(col("tokens")) <= 3).show()
    
    ## +---------+
    ## |   tokens|
    ## +---------+
    ## |[S, U, S]|
    ## +---------+
    

    If you use HiveContext then size UDF with raw SQL should work with any version:

    df.registerTempTable("df")
    sqlContext.sql("SELECT * FROM df WHERE size(tokens) <= 3").show()
    
    ## +--------------------+
    ## |              tokens|
    ## +--------------------+
    ## |ArrayBuffer(S, U, S)|
    ## +--------------------+
    

    For string columns you can either use an udf defined above or length function:

    from pyspark.sql.functions import length
    
    df = sqlContext.createDataFrame([("fooo", ), ("bar", )], ("k", ))
    df.where(length(col("k")) <= 3).show()
    
    ## +---+
    ## |  k|
    ## +---+
    ## |bar|
    ## +---+
    

提交回复
热议问题