apache-spark | 易学教程

How to register the Java SPark UDF in spark shell?

阅读更多关于 How to register the Java SPark UDF in spark shell?

问题 Below is my java udf code, package com.udf; import org.apache.spark.sql.api.java.UDF1; public class SparkUDF implements UDF1<String, String> { @Override public String call(String arg) throws Exception { if (validateString(arg)) return arg; return "INVALID"; } public static boolean validateString(String arg) { if (arg == null | arg.length() != 11) return false; else return true; } } I am building the Jar with this class as SparkUdf-1.0-SNAPSHOT.jar I am having a table name as sample in hive

How to register the Java SPark UDF in spark shell?

阅读更多关于 How to register the Java SPark UDF in spark shell?

Understanding huge shuffle spill sizes in spark

阅读更多关于 Understanding huge shuffle spill sizes in spark

问题 With Spark 2.3 I'm running the following code: rdd .persist(DISK_ONLY) // this is 3GB according to storage tab .groupBy(_.key) .mapValues(iter => iter.map(x => CaseClass(x._1, x._2))) .mapValues(iter => func(iter)) I have a sql dataframe of 300M rows I convert it to RDD, then persist it: storage tab indicates it's 3GB I do a groupBy. One of my key is receing 100M items, so roughly 1GB if I go by the RDD size I map each item after the shuffle to a case class. This case class only has 2 "double

PySpark takeOrdered Multiple Fields (Ascending and Descending)

阅读更多关于 PySpark takeOrdered Multiple Fields (Ascending and Descending)

问题 The takeOrdered Method from pyspark.RDD gets the N elements from an RDD ordered in ascending order or as specified by the optional key function as described here pyspark.RDD.takeOrdered. The example shows the following code with one key: >>> sc.parallelize([10, 1, 2, 9, 3, 4, 5, 6, 7], 2).takeOrdered(6, key=lambda x: -x) [10, 9, 7, 6, 5, 4] Is it also possible to define more keys e.g. x,y,z for data that has 3 columns? The keys should be in different order such as x= asc, y= desc, z=asc. That

PySpark takeOrdered Multiple Fields (Ascending and Descending)

阅读更多关于 PySpark takeOrdered Multiple Fields (Ascending and Descending)

Spark Get only columns that have one or more null values

阅读更多关于 Spark Get only columns that have one or more null values

问题 From a dataframe I want to get names of columns which contain at least one null value inside. Considering the dataframe below: val dataset = sparkSession.createDataFrame(Seq( (7, null, 18, 1.0), (8, "CA", null, 0.0), (9, "NZ", 15, 0.0) )).toDF("id", "country", "hour", "clicked") I want to get column names 'Country' and 'Hour'. id country hour clicked 7 null 18 1 8 "CA" null 0 9 "NZ" 15 0 回答1: this is one solution, but it's a bit awkward, I hope there is an easier way: val cols = dataset

Does Hive preserve file order when selecting data

阅读更多关于 Does Hive preserve file order when selecting data

问题 If I do select * from table1; in which order data will retrieve File order Or random order 回答1: Without ORDER BY the order is not guaranteed. Data is being read in parallel by many processes (mappers), after splits were calculated, each process starts reading some piece of file or few files, depending on splits calculated. All parallel processes can process different volume of data and running on different nodes, the load is not the same each time, so they start returning rows and finishing

Spark: Replace Null value in a Nested column

阅读更多关于 Spark: Replace Null value in a Nested column

问题 I would like to replace all the n/a values in the below dataframe to unknown . It can be either scalar or complex nested column . If it's a StructField column I can loop through the columns and replace n\a using WithColumn . But I would like this to be done in a generic way inspite of the type of the column as I dont want to specify the column names explicitly as there are 100's in my case? case class Bar(x: Int, y: String, z: String) case class Foo(id: Int, name: String, status: String, bar:

How to use spark with large decimal numbers?

阅读更多关于 How to use spark with large decimal numbers?

问题 My database has numeric value, which is up to 256-bit unsigned integer. However, spark's decimalType has a limit of Decimal(38,18). When I try to do calculations on the column, exceptions are thrown. java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 exceeds max precision 38". Is there any third-party library or workarounds that solve this issue? Or Spark is designed for numbers smaller than Decimal(38,18)? 来源： https://stackoverflow.com/questions/53074721/how-to-use

Access to WrappedArray elements

阅读更多关于 Access to WrappedArray elements