apache-spark | 易学教程

How to get the size of a data frame before doing the broadcast join in pyspark

阅读更多关于 How to get the size of a data frame before doing the broadcast join in pyspark

问题 I am new to spark ,I want to do a broadcast join and before that i am trying to get the size of my data frame that i want to broadcast.. Is there anyway to find the size of a data frame . I am using Python as my programming language for spark Any help much appreciated 回答1: If you are looking for size in bytes as well as size in row count follow this- Alternative-1 // ### Alternative -1 /** * file content * spark-test-data.json * -------------------- * {"id":1,"name":"abc1"} * {"id":2,"name":

How to get the size of a data frame before doing the broadcast join in pyspark

阅读更多关于 How to get the size of a data frame before doing the broadcast join in pyspark

How to deduplicate and keep latest based on timestamp field in spark structured streaming?

阅读更多关于 How to deduplicate and keep latest based on timestamp field in spark structured streaming?

问题 Spark dropDuplicates keeps the first instance and ignores all subsequent occurrences for that key. Is it possible to do remove duplicates while keeping the most recent occurrence? For example if below are the micro batches that I get, then I want to keep the most recent record (sorted on timestamp field) for each country. batchId: 0 Australia, 10, 2020-05-05 00:00:06 Belarus, 10, 2020-05-05 00:00:06 batchId: 1 Australia, 10, 2020-05-05 00:00:08 Belarus, 10, 2020-05-05 00:00:03 Then output

How to extract a bz2 file in spark

阅读更多关于 How to extract a bz2 file in spark

问题 I have a csv file zipped in bz2 format, like unix/linux do we have any single line command to extrac/decompress the file file.csv.bz2 to file.csv in spark-scala? 回答1: You can use built in function in SparkContext(sc), this worked for me sc.textFile("file.csv.bz2").saveAsTextFile("file.csv") 来源： https://stackoverflow.com/questions/52981195/how-to-extract-a-bz2-file-in-spark

Pyspark - Looping through structType and ArrayType to do typecasting in the structfield

阅读更多关于 Pyspark - Looping through structType and ArrayType to do typecasting in the structfield

问题 I am quite new to pyspark and this problem is boggling me. Basically I am looking for a scalable way to loop typecasting through a structType or ArrayType. Example of my data schema: root |-- _id: string (nullable = true) |-- created: timestamp (nullable = true) |-- card_rates: struct (nullable = true) | |-- rate_1: integer (nullable = true) | |-- rate_2: integer (nullable = true) | |-- rate_3: integer (nullable = true) | |-- card_fee: integer (nullable = true) | |-- payment_method: string

Merge Maps in scala dataframe

阅读更多关于 Merge Maps in scala dataframe

问题 I have a dataframe with columns col1,col2,col3. col1,col2 are strings. col3 is a Map[String,String] defined below |-- col3: map (nullable = true) | |-- key: string | |-- value: string (valueContainsNull = true) I have grouped by col1,col2 and aggregated using collect_list to get an Array of Maps and stored in col4. df.groupBy($"col1", $"col2").agg(collect_list($"col3").as("col4")) |-- col4: array (nullable = true) | |-- element: map (containsNull = true) | | |-- key: string | | |-- value:

spark.sql.hive.filesourcePartitionFileCacheSize

阅读更多关于 spark.sql.hive.filesourcePartitionFileCacheSize

问题 Just wonder if anyone is aware of this warning info 18/01/10 19:52:56 WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints (spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance I've seen this a lot when trying to load some big dataframe with many partitions from S3 into spark. It never really causes any issues to the job, just wonder what is the use of that config property and how to

spark.sql.hive.filesourcePartitionFileCacheSize

阅读更多关于 spark.sql.hive.filesourcePartitionFileCacheSize

How can I estimate the size in bytes of each column in a Spark DataFrame?

阅读更多关于 How can I estimate the size in bytes of each column in a Spark DataFrame?

问题 I have a very large Spark DataFrame with a number of columns, and I want to make an informed judgement about whether or not to keep them in my pipeline, in part based on how big they are. By "how big," I mean the size in bytes in RAM when this DataFrame is cached, which I expect to be a decent estimate for the computational cost of processing this data. Some columns are simple types (e.g. doubles, integers) but others are complex types (e.g. arrays and maps of variable length). An approach I

Multiply SparseVectors element-wise

阅读更多关于 Multiply SparseVectors element-wise

问题 I am having 2RDD and I want to multiply element-wise between these 2 rdd. Lets say that I am having the following RDD(example): a = ((1,[0.28,1,0.55]),(2,[0.28,1,0.55]),(3,[0.28,1,0.55])) aRDD = sc.parallelize(a) b = ((1,[0.28,0,0]),(2,[0,0,0]),(3,[0,1,0])) bRDD = sc.parallelize(b) It can be seen that b is sparse and I want to avoid multiply a zero value with another value. I am doing the following: from pyspark.mllib.linalg import Vectors def create_sparce_matrix(a_list): length = len(a_list