apache-spark | 易学教程

Hive/Impala performance with string partition key vs Integer partition key

阅读更多关于 Hive/Impala performance with string partition key vs Integer partition key

问题 Are numeric columns recommended for partition keys? Will there be any performance difference when we do a select query on numeric column partitions vs string column partitions? 回答1: No, there is no such recommendation. Consider this: The thing is that partition representation in Hive is a folder with a name like 'key=value' or it can be just 'value' but anyway it is string folder name. So it is being stored as string and is being cast during read/write. Partition key value is not packed

PySpark 2.x: Programmatically adding Maven JAR Coordinates to Spark

阅读更多关于 PySpark 2.x: Programmatically adding Maven JAR Coordinates to Spark

问题 The following is my PySpark startup snippet, which is pretty reliable (I've been using it a long time). Today I added the two Maven Coordinates shown in the spark.jars.packages option (effectively "plugging" in Kafka support). Now that normally triggers dependency downloads (performed by Spark automatically): import sys, os, multiprocessing from pyspark.sql import DataFrame, DataFrameStatFunctions, DataFrameNaFunctions from pyspark.conf import SparkConf from pyspark.sql import SparkSession

Spark error - Decimal precision 39 exceeds max precision 38

阅读更多关于 Spark error - Decimal precision 39 exceeds max precision 38

问题 When I try to collect data from Spark dataframe, I get an error stating "java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 exceeds max precision 38". All the data which is in Spark dataframe is from Oracle database, where I believe decimal precision is <38. Is there any way I can achieve this without modifying the data? # Load required table into memory from Oracle database df <- loadDF(sqlContext, source = "jdbc", url = "jdbc:oracle:thin:usr/pass@url.com:1521" ,

Spark error - Decimal precision 39 exceeds max precision 38

阅读更多关于 Spark error - Decimal precision 39 exceeds max precision 38

Spark error - Decimal precision 39 exceeds max precision 38

阅读更多关于 Spark error - Decimal precision 39 exceeds max precision 38

Spark SQL - rlike ignore case

阅读更多关于 Spark SQL - rlike ignore case

问题 I am using spark SQL and trying to compare a string using rlike it works fine, however would like to understand how to ignore case. this return true select "1 Week Ending Jan 14, 2018" rlike "^\\d+ Week Ending [a-z, A-Z]{3} \\d{2}, \\d{4}" However, this return False, select "1 Week Ending Jan 14, 2018" rlike "^\\d+ week ending [a-z, A-Z]{3} \\d{2}, \\d{4}" 回答1: Spark is using the standard Scala regex library, so you can inline the processing flags in the pattern, for example (?i) for case

Spark SQL - rlike ignore case

阅读更多关于 Spark SQL - rlike ignore case

Spark SQL - rlike ignore case

阅读更多关于 Spark SQL - rlike ignore case

How to read all files in S3 folder/bucket using sparklyr in R?

阅读更多关于 How to read all files in S3 folder/bucket using sparklyr in R?

问题 I have tried below code & its combinations in order to read all files given in a S3 folder , but nothing seems to be working .. Sensitive information/code is removed from the below script. There are 6 files each with 6.5 GB . #Spark Connection sc<-spark_connect(master = "local" , config=config) rd_1<-spark_read_csv(sc,name = "Retail_1",path = "s3a://mybucket/xyzabc/Retail_Industry/*/*",header = F,delimiter = "|") # This is the S3 bucket/folder for files [One of the file names Industry_Raw

Java Spark : Stack Overflow Error on GroupBy

阅读更多关于 Java Spark : Stack Overflow Error on GroupBy

问题 I am using Spark 2.3.1 with Java. I have a Dataset, which I want to group to make some aggregations (let's say a count() for the example). The grouping must be done according to a given list of columns. My function is the following : public Dataset<Row> compute(Dataset<Row> data, List<String> columns){ final List<Column> columns_col = new ArrayList<Column>(); for (final String tag : columns) { columns_col.add(new Column(tag)); } Seq<Column> columns_seq = JavaConverters