apache-spark

Hive/Impala performance with string partition key vs Integer partition key

梦想的初衷 提交于 2021-02-07 19:53:16
问题 Are numeric columns recommended for partition keys? Will there be any performance difference when we do a select query on numeric column partitions vs string column partitions? 回答1: No, there is no such recommendation. Consider this: The thing is that partition representation in Hive is a folder with a name like 'key=value' or it can be just 'value' but anyway it is string folder name. So it is being stored as string and is being cast during read/write. Partition key value is not packed

PySpark 2.x: Programmatically adding Maven JAR Coordinates to Spark

冷暖自知 提交于 2021-02-07 19:42:06
问题 The following is my PySpark startup snippet, which is pretty reliable (I've been using it a long time). Today I added the two Maven Coordinates shown in the spark.jars.packages option (effectively "plugging" in Kafka support). Now that normally triggers dependency downloads (performed by Spark automatically): import sys, os, multiprocessing from pyspark.sql import DataFrame, DataFrameStatFunctions, DataFrameNaFunctions from pyspark.conf import SparkConf from pyspark.sql import SparkSession

Spark error - Decimal precision 39 exceeds max precision 38

拜拜、爱过 提交于 2021-02-07 19:34:20
问题 When I try to collect data from Spark dataframe, I get an error stating "java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 exceeds max precision 38". All the data which is in Spark dataframe is from Oracle database, where I believe decimal precision is <38. Is there any way I can achieve this without modifying the data? # Load required table into memory from Oracle database df <- loadDF(sqlContext, source = "jdbc", url = "jdbc:oracle:thin:usr/pass@url.com:1521" ,

Spark error - Decimal precision 39 exceeds max precision 38

守給你的承諾、 提交于 2021-02-07 19:31:45
问题 When I try to collect data from Spark dataframe, I get an error stating "java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 exceeds max precision 38". All the data which is in Spark dataframe is from Oracle database, where I believe decimal precision is <38. Is there any way I can achieve this without modifying the data? # Load required table into memory from Oracle database df <- loadDF(sqlContext, source = "jdbc", url = "jdbc:oracle:thin:usr/pass@url.com:1521" ,

Spark error - Decimal precision 39 exceeds max precision 38

家住魔仙堡 提交于 2021-02-07 19:31:24
问题 When I try to collect data from Spark dataframe, I get an error stating "java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 exceeds max precision 38". All the data which is in Spark dataframe is from Oracle database, where I believe decimal precision is <38. Is there any way I can achieve this without modifying the data? # Load required table into memory from Oracle database df <- loadDF(sqlContext, source = "jdbc", url = "jdbc:oracle:thin:usr/pass@url.com:1521" ,

Spark SQL - rlike ignore case

前提是你 提交于 2021-02-07 19:18:07
问题 I am using spark SQL and trying to compare a string using rlike it works fine, however would like to understand how to ignore case. this return true select "1 Week Ending Jan 14, 2018" rlike "^\\d+ Week Ending [a-z, A-Z]{3} \\d{2}, \\d{4}" However, this return False, select "1 Week Ending Jan 14, 2018" rlike "^\\d+ week ending [a-z, A-Z]{3} \\d{2}, \\d{4}" 回答1: Spark is using the standard Scala regex library, so you can inline the processing flags in the pattern, for example (?i) for case

Spark SQL - rlike ignore case

孤街醉人 提交于 2021-02-07 19:17:57
问题 I am using spark SQL and trying to compare a string using rlike it works fine, however would like to understand how to ignore case. this return true select "1 Week Ending Jan 14, 2018" rlike "^\\d+ Week Ending [a-z, A-Z]{3} \\d{2}, \\d{4}" However, this return False, select "1 Week Ending Jan 14, 2018" rlike "^\\d+ week ending [a-z, A-Z]{3} \\d{2}, \\d{4}" 回答1: Spark is using the standard Scala regex library, so you can inline the processing flags in the pattern, for example (?i) for case

Spark SQL - rlike ignore case

◇◆丶佛笑我妖孽 提交于 2021-02-07 19:17:09
问题 I am using spark SQL and trying to compare a string using rlike it works fine, however would like to understand how to ignore case. this return true select "1 Week Ending Jan 14, 2018" rlike "^\\d+ Week Ending [a-z, A-Z]{3} \\d{2}, \\d{4}" However, this return False, select "1 Week Ending Jan 14, 2018" rlike "^\\d+ week ending [a-z, A-Z]{3} \\d{2}, \\d{4}" 回答1: Spark is using the standard Scala regex library, so you can inline the processing flags in the pattern, for example (?i) for case

How to read all files in S3 folder/bucket using sparklyr in R?

只谈情不闲聊 提交于 2021-02-07 17:24:30
问题 I have tried below code & its combinations in order to read all files given in a S3 folder , but nothing seems to be working .. Sensitive information/code is removed from the below script. There are 6 files each with 6.5 GB . #Spark Connection sc<-spark_connect(master = "local" , config=config) rd_1<-spark_read_csv(sc,name = "Retail_1",path = "s3a://mybucket/xyzabc/Retail_Industry/*/*",header = F,delimiter = "|") # This is the S3 bucket/folder for files [One of the file names Industry_Raw

Java Spark : Stack Overflow Error on GroupBy

坚强是说给别人听的谎言 提交于 2021-02-07 16:08:31
问题 I am using Spark 2.3.1 with Java. I have a Dataset, which I want to group to make some aggregations (let's say a count() for the example). The grouping must be done according to a given list of columns. My function is the following : public Dataset<Row> compute(Dataset<Row> data, List<String> columns){ final List<Column> columns_col = new ArrayList<Column>(); for (final String tag : columns) { columns_col.add(new Column(tag)); } Seq<Column> columns_seq = JavaConverters