apache-spark

How to get the size of a data frame before doing the broadcast join in pyspark

时光总嘲笑我的痴心妄想 提交于 2021-02-08 09:14:02
问题 I am new to spark ,I want to do a broadcast join and before that i am trying to get the size of my data frame that i want to broadcast.. Is there anyway to find the size of a data frame . I am using Python as my programming language for spark Any help much appreciated 回答1: If you are looking for size in bytes as well as size in row count follow this- Alternative-1 // ### Alternative -1 /** * file content * spark-test-data.json * -------------------- * {"id":1,"name":"abc1"} * {"id":2,"name":

How to get the size of a data frame before doing the broadcast join in pyspark

柔情痞子 提交于 2021-02-08 09:10:22
问题 I am new to spark ,I want to do a broadcast join and before that i am trying to get the size of my data frame that i want to broadcast.. Is there anyway to find the size of a data frame . I am using Python as my programming language for spark Any help much appreciated 回答1: If you are looking for size in bytes as well as size in row count follow this- Alternative-1 // ### Alternative -1 /** * file content * spark-test-data.json * -------------------- * {"id":1,"name":"abc1"} * {"id":2,"name":

How to deduplicate and keep latest based on timestamp field in spark structured streaming?

天大地大妈咪最大 提交于 2021-02-08 08:44:17
问题 Spark dropDuplicates keeps the first instance and ignores all subsequent occurrences for that key. Is it possible to do remove duplicates while keeping the most recent occurrence? For example if below are the micro batches that I get, then I want to keep the most recent record (sorted on timestamp field) for each country. batchId: 0 Australia, 10, 2020-05-05 00:00:06 Belarus, 10, 2020-05-05 00:00:06 batchId: 1 Australia, 10, 2020-05-05 00:00:08 Belarus, 10, 2020-05-05 00:00:03 Then output

How to extract a bz2 file in spark

雨燕双飞 提交于 2021-02-08 08:39:17
问题 I have a csv file zipped in bz2 format, like unix/linux do we have any single line command to extrac/decompress the file file.csv.bz2 to file.csv in spark-scala? 回答1: You can use built in function in SparkContext(sc), this worked for me sc.textFile("file.csv.bz2").saveAsTextFile("file.csv") 来源: https://stackoverflow.com/questions/52981195/how-to-extract-a-bz2-file-in-spark

Pyspark - Looping through structType and ArrayType to do typecasting in the structfield

让人想犯罪 __ 提交于 2021-02-08 08:38:15
问题 I am quite new to pyspark and this problem is boggling me. Basically I am looking for a scalable way to loop typecasting through a structType or ArrayType. Example of my data schema: root |-- _id: string (nullable = true) |-- created: timestamp (nullable = true) |-- card_rates: struct (nullable = true) | |-- rate_1: integer (nullable = true) | |-- rate_2: integer (nullable = true) | |-- rate_3: integer (nullable = true) | |-- card_fee: integer (nullable = true) | |-- payment_method: string

Merge Maps in scala dataframe

南楼画角 提交于 2021-02-08 08:32:30
问题 I have a dataframe with columns col1,col2,col3. col1,col2 are strings. col3 is a Map[String,String] defined below |-- col3: map (nullable = true) | |-- key: string | |-- value: string (valueContainsNull = true) I have grouped by col1,col2 and aggregated using collect_list to get an Array of Maps and stored in col4. df.groupBy($"col1", $"col2").agg(collect_list($"col3").as("col4")) |-- col4: array (nullable = true) | |-- element: map (containsNull = true) | | |-- key: string | | |-- value:

spark.sql.hive.filesourcePartitionFileCacheSize

喜夏-厌秋 提交于 2021-02-08 08:20:37
问题 Just wonder if anyone is aware of this warning info 18/01/10 19:52:56 WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints (spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance I've seen this a lot when trying to load some big dataframe with many partitions from S3 into spark. It never really causes any issues to the job, just wonder what is the use of that config property and how to

spark.sql.hive.filesourcePartitionFileCacheSize

自古美人都是妖i 提交于 2021-02-08 08:20:31
问题 Just wonder if anyone is aware of this warning info 18/01/10 19:52:56 WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints (spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance I've seen this a lot when trying to load some big dataframe with many partitions from S3 into spark. It never really causes any issues to the job, just wonder what is the use of that config property and how to

How can I estimate the size in bytes of each column in a Spark DataFrame?

僤鯓⒐⒋嵵緔 提交于 2021-02-08 08:16:03
问题 I have a very large Spark DataFrame with a number of columns, and I want to make an informed judgement about whether or not to keep them in my pipeline, in part based on how big they are. By "how big," I mean the size in bytes in RAM when this DataFrame is cached, which I expect to be a decent estimate for the computational cost of processing this data. Some columns are simple types (e.g. doubles, integers) but others are complex types (e.g. arrays and maps of variable length). An approach I

Multiply SparseVectors element-wise

不想你离开。 提交于 2021-02-08 08:15:17
问题 I am having 2RDD and I want to multiply element-wise between these 2 rdd. Lets say that I am having the following RDD(example): a = ((1,[0.28,1,0.55]),(2,[0.28,1,0.55]),(3,[0.28,1,0.55])) aRDD = sc.parallelize(a) b = ((1,[0.28,0,0]),(2,[0,0,0]),(3,[0,1,0])) bRDD = sc.parallelize(b) It can be seen that b is sparse and I want to avoid multiply a zero value with another value. I am doing the following: from pyspark.mllib.linalg import Vectors def create_sparce_matrix(a_list): length = len(a_list