apache-spark

How to register the Java SPark UDF in spark shell?

三世轮回 提交于 2021-02-19 07:35:34
问题 Below is my java udf code, package com.udf; import org.apache.spark.sql.api.java.UDF1; public class SparkUDF implements UDF1<String, String> { @Override public String call(String arg) throws Exception { if (validateString(arg)) return arg; return "INVALID"; } public static boolean validateString(String arg) { if (arg == null | arg.length() != 11) return false; else return true; } } I am building the Jar with this class as SparkUdf-1.0-SNAPSHOT.jar I am having a table name as sample in hive

How to register the Java SPark UDF in spark shell?

烈酒焚心 提交于 2021-02-19 07:35:10
问题 Below is my java udf code, package com.udf; import org.apache.spark.sql.api.java.UDF1; public class SparkUDF implements UDF1<String, String> { @Override public String call(String arg) throws Exception { if (validateString(arg)) return arg; return "INVALID"; } public static boolean validateString(String arg) { if (arg == null | arg.length() != 11) return false; else return true; } } I am building the Jar with this class as SparkUdf-1.0-SNAPSHOT.jar I am having a table name as sample in hive

Understanding huge shuffle spill sizes in spark

空扰寡人 提交于 2021-02-19 05:58:49
问题 With Spark 2.3 I'm running the following code: rdd .persist(DISK_ONLY) // this is 3GB according to storage tab .groupBy(_.key) .mapValues(iter => iter.map(x => CaseClass(x._1, x._2))) .mapValues(iter => func(iter)) I have a sql dataframe of 300M rows I convert it to RDD, then persist it: storage tab indicates it's 3GB I do a groupBy. One of my key is receing 100M items, so roughly 1GB if I go by the RDD size I map each item after the shuffle to a case class. This case class only has 2 "double

PySpark takeOrdered Multiple Fields (Ascending and Descending)

余生长醉 提交于 2021-02-19 05:20:26
问题 The takeOrdered Method from pyspark.RDD gets the N elements from an RDD ordered in ascending order or as specified by the optional key function as described here pyspark.RDD.takeOrdered. The example shows the following code with one key: >>> sc.parallelize([10, 1, 2, 9, 3, 4, 5, 6, 7], 2).takeOrdered(6, key=lambda x: -x) [10, 9, 7, 6, 5, 4] Is it also possible to define more keys e.g. x,y,z for data that has 3 columns? The keys should be in different order such as x= asc, y= desc, z=asc. That

PySpark takeOrdered Multiple Fields (Ascending and Descending)

落花浮王杯 提交于 2021-02-19 05:19:30
问题 The takeOrdered Method from pyspark.RDD gets the N elements from an RDD ordered in ascending order or as specified by the optional key function as described here pyspark.RDD.takeOrdered. The example shows the following code with one key: >>> sc.parallelize([10, 1, 2, 9, 3, 4, 5, 6, 7], 2).takeOrdered(6, key=lambda x: -x) [10, 9, 7, 6, 5, 4] Is it also possible to define more keys e.g. x,y,z for data that has 3 columns? The keys should be in different order such as x= asc, y= desc, z=asc. That

Spark Get only columns that have one or more null values

混江龙づ霸主 提交于 2021-02-19 04:25:47
问题 From a dataframe I want to get names of columns which contain at least one null value inside. Considering the dataframe below: val dataset = sparkSession.createDataFrame(Seq( (7, null, 18, 1.0), (8, "CA", null, 0.0), (9, "NZ", 15, 0.0) )).toDF("id", "country", "hour", "clicked") I want to get column names 'Country' and 'Hour'. id country hour clicked 7 null 18 1 8 "CA" null 0 9 "NZ" 15 0 回答1: this is one solution, but it's a bit awkward, I hope there is an easier way: val cols = dataset

Does Hive preserve file order when selecting data

不打扰是莪最后的温柔 提交于 2021-02-19 04:05:44
问题 If I do select * from table1; in which order data will retrieve File order Or random order 回答1: Without ORDER BY the order is not guaranteed. Data is being read in parallel by many processes (mappers), after splits were calculated, each process starts reading some piece of file or few files, depending on splits calculated. All parallel processes can process different volume of data and running on different nodes, the load is not the same each time, so they start returning rows and finishing

Spark: Replace Null value in a Nested column

跟風遠走 提交于 2021-02-19 03:53:08
问题 I would like to replace all the n/a values in the below dataframe to unknown . It can be either scalar or complex nested column . If it's a StructField column I can loop through the columns and replace n\a using WithColumn . But I would like this to be done in a generic way inspite of the type of the column as I dont want to specify the column names explicitly as there are 100's in my case? case class Bar(x: Int, y: String, z: String) case class Foo(id: Int, name: String, status: String, bar:

How to use spark with large decimal numbers?

不打扰是莪最后的温柔 提交于 2021-02-19 03:51:46
问题 My database has numeric value, which is up to 256-bit unsigned integer. However, spark's decimalType has a limit of Decimal(38,18). When I try to do calculations on the column, exceptions are thrown. java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 exceeds max precision 38". Is there any third-party library or workarounds that solve this issue? Or Spark is designed for numbers smaller than Decimal(38,18)? 来源: https://stackoverflow.com/questions/53074721/how-to-use

Access to WrappedArray elements

百般思念 提交于 2021-02-19 02:37:39
问题 I have a spark dataframe and here is the schema: |-- eid: long (nullable = true) |-- age: long (nullable = true) |-- sex: long (nullable = true) |-- father: array (nullable = true) | |-- element: array (containsNull = true) | | |-- element: long (containsNull = true) and a sample of rows:. df.select(df['father']).show() +--------------------+ | father| +--------------------+ |[WrappedArray(-17...| |[WrappedArray(-11...| |[WrappedArray(13,...| +--------------------+ and the type is DataFrame