可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I use Spark-shell to do the below operations
Recently loaded a table with an array column in spark-sql .
Here is the ddl for the same:
create table test_emp_arr{ dept_id string, dept_nm string, emp_details Array }
the data looks something like this
+-------+-------+-------------------------------+ |dept_id|dept_nm| emp_details| +-------+-------+-------------------------------+ | 10|Finance|[Jon, Snow, Castle, Black, Ned]| | 20| IT| [Ned, is, no, more]| +-------+-------+-------------------------------+
i can query the emp_details column something like this :
sqlContext.sql("select emp_details[0] from emp_details").show
Problem
I want to query a range of elements in the collection :
Expected query to work
sqlContext.sql("select emp_details[0-2] from emp_details").show
or
sqlContext.sql("select emp_details[0:2] from emp_details").show
Expected output
+-------------------+ | emp_details| +-------------------+ |[Jon, Snow, Castle]| | [Ned, is, no]| +-------------------+
In pure scala if i have an array something as :
val emp_details = Array("Jon","Snow","Castle","Black")
i can get the elements from 0 to 2 range using
emp_details.slice(0,3)
returns me
Array(Jon, Snow,Castle)
I am not able to apply the above operation of the array in spark-sql . any help ?
Thanks
回答1:
Here is a solution using a User Defined Function which has the advantage of working for any slice size you want. It simply builds a UDF function around the scala builtin slice
method :
import sqlContext.implicits._ import org.apache.spark.sql.functions._ val slice = udf((array : Seq[String], from : Int, to : Int) => array.slice(from,to))
Example with a sample of your data :
val df = sqlContext.sql("select array('Jon', 'Snow', 'Castle', 'Black', 'Ned') as emp_details") df.withColumn("slice", slice($"emp_details", lit(0), lit(3))).show
Produces the expected output
+--------------------+-------------------+ | emp_details| slice| +--------------------+-------------------+ |[Jon, Snow, Castl...|[Jon, Snow, Castle]| +--------------------+-------------------+
You can also register the UDF in your sqlContext
and use it like this
You won't need lit
anymore with this solution
回答2:
Edit2: For who wants to avoid udf at the expense of readability ;-)
If you really want to do it in one step, you will have to use Scala to create a lambda function returning an sequence of Column
and wrap it in an array. This is a bit involved, but it's one step:
val df = List(List("Jon", "Snow", "Castle", "Black", "Ned")).toDF("emp_details") df.withColumn("slice", array((0 until 3).map(i => $"emp_details"(i)):_*)).show(false) +-------------------------------+-------------------+ |emp_details |slice | +-------------------------------+-------------------+ |[Jon, Snow, Castle, Black, Ned]|[Jon, Snow, Castle]| +-------------------------------+-------------------+
The _:*
works a bit of magic to pass an list to a so-called variadic function (array
in this case, which construct the sql array). But I would advice against using this solution as is. put the lambda function in a named function
def slice(from: Int, to: Int) = array((from until to).map(i => $"emp_details"(i)):_*))
for code readability. Note that in general, sticking to Column
expressions (without using `udf) has better performances.
Edit: In order to do it in a sql statement (as you ask in your question...), following the same logic you would generate the sql query using scala logic (not saying it's the most readable)
def sliceSql(emp_details: String, from: Int, to: Int): String = "Array(" + (from until to).map(i => "emp_details["+i.toString+"]").mkString(",") + ")" val sqlQuery = "select emp_details,"+ sliceSql("emp_details",0,3) + "as slice from emp_details" sqlContext.sql(sqlQuery).show +-------------------------------+-------------------+ |emp_details |slice | +-------------------------------+-------------------+ |[Jon, Snow, Castle, Black, Ned]|[Jon, Snow, Castle]| +-------------------------------+-------------------+
note that you can replace until
by to
in order to provide the last element taken rather than the element at which the iteration stops.
回答3:
You can use the function array
to build a new Array out of the three values:
import org.apache.spark.sql.functions._ val input = sqlContext.sql("select emp_details from emp_details") val arr: Column = col("emp_details") val result = input.select(array(arr(0), arr(1), arr(2)) as "emp_details") val result.show() // +-------------------+ // | emp_details| // +-------------------+ // |[Jon, Snow, Castle]| // | [Ned, is, no]| // +-------------------+
回答4:
use selecrExpr() and split() function in apache spark.
for example :
fs.selectExpr("((split(emp_details, ','))[0]) as e1,((split(emp_details, ','))[1]) as e2,((split(emp_details, ','))[2]) as e3);
回答5:
Here is my generic slice UDF, support array with any type. A little bit ugly because you need to know the element type in advance.
import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ def arraySlice(arr: Seq[AnyRef], from: Int, until: Int): Seq[AnyRef] = if (arr == null) null else arr.slice(from, until) def slice(elemType: DataType): UserDefinedFunction = udf(arraySlice _, ArrayType(elemType) fs.select(slice(StringType)($"emp_details", 1, 2))