selecting a range of elements in an array spark sql

匿名 (未验证) 提交于 2019-12-03 01:26:01

问题:

I use Spark-shell to do the below operations

Recently loaded a table with an array column in spark-sql .

Here is the ddl for the same:

create table test_emp_arr{     dept_id string,     dept_nm string,     emp_details Array } 

the data looks something like this

+-------+-------+-------------------------------+ |dept_id|dept_nm|                     emp_details| +-------+-------+-------------------------------+ |     10|Finance|[Jon, Snow, Castle, Black, Ned]| |     20|     IT|            [Ned, is, no, more]| +-------+-------+-------------------------------+ 

i can query the emp_details column something like this :

sqlContext.sql("select emp_details[0] from emp_details").show 

Problem

I want to query a range of elements in the collection :

Expected query to work

sqlContext.sql("select emp_details[0-2] from emp_details").show 

or

sqlContext.sql("select emp_details[0:2] from emp_details").show 

Expected output

+-------------------+ |        emp_details| +-------------------+ |[Jon, Snow, Castle]| |      [Ned, is, no]| +-------------------+ 

In pure scala if i have an array something as :

val emp_details = Array("Jon","Snow","Castle","Black") 

i can get the elements from 0 to 2 range using

emp_details.slice(0,3) 

returns me

Array(Jon, Snow,Castle) 

I am not able to apply the above operation of the array in spark-sql . any help ?

Thanks

回答1:

Here is a solution using a User Defined Function which has the advantage of working for any slice size you want. It simply builds a UDF function around the scala builtin slice method :

import sqlContext.implicits._ import org.apache.spark.sql.functions._  val slice = udf((array : Seq[String], from : Int, to : Int) => array.slice(from,to)) 

Example with a sample of your data :

val df = sqlContext.sql("select array('Jon', 'Snow', 'Castle', 'Black', 'Ned') as emp_details") df.withColumn("slice", slice($"emp_details", lit(0), lit(3))).show 

Produces the expected output

+--------------------+-------------------+ |         emp_details|              slice| +--------------------+-------------------+ |[Jon, Snow, Castl...|[Jon, Snow, Castle]| +--------------------+-------------------+ 

You can also register the UDF in your sqlContext and use it like this

You won't need lit anymore with this solution



回答2:

Edit2: For who wants to avoid udf at the expense of readability ;-)

If you really want to do it in one step, you will have to use Scala to create a lambda function returning an sequence of Column and wrap it in an array. This is a bit involved, but it's one step:

val df = List(List("Jon", "Snow", "Castle", "Black", "Ned")).toDF("emp_details")  df.withColumn("slice", array((0 until 3).map(i => $"emp_details"(i)):_*)).show(false)       +-------------------------------+-------------------+ |emp_details                    |slice              | +-------------------------------+-------------------+ |[Jon, Snow, Castle, Black, Ned]|[Jon, Snow, Castle]| +-------------------------------+-------------------+ 

The _:* works a bit of magic to pass an list to a so-called variadic function (array in this case, which construct the sql array). But I would advice against using this solution as is. put the lambda function in a named function

def slice(from: Int, to: Int) = array((from until to).map(i => $"emp_details"(i)):_*)) 

for code readability. Note that in general, sticking to Column expressions (without using `udf) has better performances.

Edit: In order to do it in a sql statement (as you ask in your question...), following the same logic you would generate the sql query using scala logic (not saying it's the most readable)

def sliceSql(emp_details: String, from: Int, to: Int): String = "Array(" + (from until to).map(i => "emp_details["+i.toString+"]").mkString(",") + ")" val sqlQuery = "select emp_details,"+ sliceSql("emp_details",0,3) + "as slice from emp_details"  sqlContext.sql(sqlQuery).show  +-------------------------------+-------------------+ |emp_details                    |slice              | +-------------------------------+-------------------+ |[Jon, Snow, Castle, Black, Ned]|[Jon, Snow, Castle]| +-------------------------------+-------------------+ 

note that you can replace until by to in order to provide the last element taken rather than the element at which the iteration stops.



回答3:

You can use the function array to build a new Array out of the three values:

import org.apache.spark.sql.functions._  val input = sqlContext.sql("select emp_details from emp_details")  val arr: Column = col("emp_details") val result = input.select(array(arr(0), arr(1), arr(2)) as "emp_details")  val result.show() // +-------------------+ // |        emp_details| // +-------------------+ // |[Jon, Snow, Castle]| // |      [Ned, is, no]| // +-------------------+ 


回答4:

use selecrExpr() and split() function in apache spark.

for example :

fs.selectExpr("((split(emp_details, ','))[0]) as e1,((split(emp_details, ','))[1]) as e2,((split(emp_details, ','))[2]) as e3); 


回答5:

Here is my generic slice UDF, support array with any type. A little bit ugly because you need to know the element type in advance.

import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._  def arraySlice(arr: Seq[AnyRef], from: Int, until: Int): Seq[AnyRef] =   if (arr == null) null else arr.slice(from, until)  def slice(elemType: DataType): UserDefinedFunction =    udf(arraySlice _, ArrayType(elemType)  fs.select(slice(StringType)($"emp_details", 1, 2)) 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!