apache-spark

In Apache Spark, can I incrementally cache an RDD partition?

懵懂的女人 提交于 2021-02-11 13:56:45
问题 I was under the impression that both RDD execution and caching are lazy: Namely, if an RDD is cached, and only part of it was used, then the caching mechanism will only cache that part, and the other part will be computed on-demand. Unfortunately, the following experiment seems to indicate otherwise: val acc = new LongAccumulator() TestSC.register(acc) val rdd = TestSC.parallelize(1 to 100, 16).map { v => acc add 1 v } rdd.persist() val sliced = rdd .mapPartitions { itr => itr.slice(0, 2) }

Spark reading Partitioned avro significantly slower than pointing to exact location

十年热恋 提交于 2021-02-11 13:35:22
问题 I am trying to read partitioned Avro data which is partitioned based on Year, Month and Day and that seems to be significantly slower than pointing it directly to the path. In the Physical plan I can see that the partition filters are getting passed on, so it is not scanning the entire set of directories but still it is significantly slower. E.g. reading the partitioned data like this profitLossPath="abfss://raw@"+datalakename+".dfs.core.windows.net/datawarehouse/CommercialDM.ProfitLoss/"

Spark reading Partitioned avro significantly slower than pointing to exact location

瘦欲@ 提交于 2021-02-11 13:33:04
问题 I am trying to read partitioned Avro data which is partitioned based on Year, Month and Day and that seems to be significantly slower than pointing it directly to the path. In the Physical plan I can see that the partition filters are getting passed on, so it is not scanning the entire set of directories but still it is significantly slower. E.g. reading the partitioned data like this profitLossPath="abfss://raw@"+datalakename+".dfs.core.windows.net/datawarehouse/CommercialDM.ProfitLoss/"

Unable to locate hive jars to connect to metastore : while using pyspark job to connect to athena tables

坚强是说给别人听的谎言 提交于 2021-02-11 13:19:44
问题 We are using sagemaker instance to connect to EMR in AWS. We are having some pyspark scripts that unloads athena tables and processes them as part of pipeline. We are using athena tables using glue catalog but when we try to run the job via spark submit, our job fails Code snippet from pyspark import SparkContext, SparkConf from pyspark.context import SparkContext from pyspark.sql import Row, SQLContext, SparkSession import pyspark.sql.dataframe def process_data(): conf = SparkConf()

Unable to stop Kerberos debug logging

你说的曾经没有我的故事 提交于 2021-02-11 13:01:35
问题 I'm using a Kerberos enabled Spark cluster for running our Spark applications. The Kerberos has been setup previously by other members of the organization, and I have no idea how it works. In the initial days, we had used the Kerberos debug logs to understand the exception "Unable to obtain password from user" which was being raised due to absence of a JCE certificate in the cacerts folder of jre security. However, we no longer require the logs and thus, used the -Dsun.security.krb5.debug

How to parse dynamic Json with dynamic keys inside it in Scala

一个人想着一个人 提交于 2021-02-11 12:56:58
问题 I am trying to parse Json structure which is dynamic in nature and load into database. But facing difficulty where json has dynamic keys inside it. Below is my sample json: Have tried using explode function but didn't help. moslty similar thing is described here How to parse a dynamic JSON key in a Nested JSON result? { "_id": { "planId": "5f34dab0c661d8337097afb9", "version": { "$numberLong": "1" }, "period": { "name" : "3Q20", "startDate": 20200629, "endDate": 20200927 }, "line": "b443e9c0

Count of values in a row in spark dataframe using scala

笑着哭i 提交于 2021-02-11 12:52:24
问题 I have a dataframe. It contains the amount of sales for different items across different sales outlets. The dataframe shown below only shows few of the items across few sales outlets. There's a bench mark of 100 items per day sale for each item. For each item that's sold more than 100, it is marked as "Yes" and those below 100 is marked as "No" val df1 = Seq( ("Mumbai", 90, 109, , 101, 78, ............., "No", "Yes", "Yes", "No", .....), ("Singapore", 149, 129, , 201, 107, ............., "Yes

Spark - Scope, Data Frame, and memory management

空扰寡人 提交于 2021-02-11 12:41:43
问题 I am curious about how scope works with Data Frame and Spark. In the example below, I have a list of file, each independently loaded in a Data Frame, some operation is performed, then, we write dfOutput to disk. val files = getListOfFiles("outputs/emailsSplit") for (file <- files){ val df = sqlContext.read .format("com.databricks.spark.csv") .option("delimiter","\t") // Delimiter is tab .option("parserLib", "UNIVOCITY") // Parser, which deals better with the email formatting .schema

RDS to S3 - Data Transformation AWS

半城伤御伤魂 提交于 2021-02-11 12:31:32
问题 I have about 30 tables in my RDS postgres / oracle (haven't decided if it is oracle or postgres yet) instance. I want to fetch all the records that have been inserted / updated in the last 4 hours (configurable) , create a csv file pertaining to each table and store the files in S3. I want this whole process to be transactional. If there is any error in fetching data from one table , I don't want data pertinent to other 29 tables to be persisted in S3. The data isn't very large , it should be

Apache Spark + Parquet not Respecting Configuration to use “Partitioned” Staging S3A Committer

妖精的绣舞 提交于 2021-02-11 12:31:30
问题 I am writing partitioned data (Parquet file) to AWS S3 using Apache Spark (3.0) from my local machine without having Hadoop installed in my machine. I was getting FileNotFoundException while writing to S3 when I have lot of files to write to around 50 partitions(partitionBy = date). Then I have come across new S3A committer, So I tried to configure "partitioned" committer instead. But still I could see that Spark uses ParquetOutputCommitter instead of PartitionedStagingCommitter when the file