apache-spark | 易学教程

SparkContext Error - File not found /tmp/spark-events does not exist

阅读更多关于 SparkContext Error - File not found /tmp/spark-events does not exist

问题 Running a Python Spark Application via API call - On submitting the Application - response - Failed SSH into the Worker My python application exists in /root/spark/work/driver-id/wordcount.py Error can be found in /root/spark/work/driver-id/stderr Show the following error - Traceback (most recent call last): File "/root/wordcount.py", line 34, in <module> main() File "/root/wordcount.py", line 18, in main sc = SparkContext(conf=conf) File "/root/spark/python/lib/pyspark.zip/pyspark/context.py

SparkContext Error - File not found /tmp/spark-events does not exist

阅读更多关于 SparkContext Error - File not found /tmp/spark-events does not exist

Difference between explode and explode_outer

阅读更多关于 Difference between explode and explode_outer

问题 What is the difference between explode and explode_outer? The documentation for both functions is the same and also the examples for both functions are identical: SELECT explode(array(10, 20)); 10 20 and SELECT explode_outer(array(10, 20)); 10 20 The Spark source suggests that there is a difference between the two functions expression[Explode]("explode"), expressionGeneratorOuter[Explode]("explode_outer") but what is the effect of expressionGeneratorOuter compared to expression? 回答1: explode

pySpark, aggregate complex function (difference of consecutive events)

阅读更多关于 pySpark, aggregate complex function (difference of consecutive events)

问题 I have a DataFrame ( df ) whose columns are userid (the user id), day (the day). I'm interested in computing, for every user, the average time interval between each day he/she was active. For instance, for a given user the DataFrame may look something like this userid day 1 2016-09-18 1 2016-09-20 1 2016-09-25 If the DataFrame is a Pandas DataFrame, I could compute the quantity I'm interested in like this import numpy as np np.mean(np.diff(df[df.userid==1].day)) However, this is quite

pySpark, aggregate complex function (difference of consecutive events)

阅读更多关于 pySpark, aggregate complex function (difference of consecutive events)

Normalize a complex nested JSON file

阅读更多关于 Normalize a complex nested JSON file

问题 Im trying to normalize the below json file into 4 tables - "content", "Modules", "Images" and "Everything Else in another table" { "id": "0000050a", "revision": 1580225050941, "slot": "product-description", "type": "E", "create_date": 1580225050941, "modified_date": 1580225050941, "creator": "Auto", "modifier": "Auto", "audit_info": { "date": 1580225050941, "source": "AutoService", "username": "Auto" }, "total_ID": 1, "name": "Auto_A1AM78C64UM0Y8_B07JCJR5HW", "content": [{ "ID": ["B01"],

Does the state also gets removed on event timeout with spark structured streaming?

阅读更多关于 Does the state also gets removed on event timeout with spark structured streaming?

问题 Q. Does the state gets timed out and also gets removed at the same time or only the state gets timed out and state still remains for both ProcessingTimeout and EventTimeout? I was doing some experiment with mapGroupsWithState/flatmapGroupsWithState and having some confusion with the state timeout. Considering I am maintaining a state with a watermark of 10 seconds and applying time out based on event time say : ds.withWatermark("timestamp", "10 seconds") .groupByKey(...) .mapGroupsWithState(

Spark structured streaming with kafka leads to only one batch (Pyspark)

阅读更多关于 Spark structured streaming with kafka leads to only one batch (Pyspark)

问题 I have the following code and I'm wondering why it generates only one batch: df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "IP").option("subscribe", "Topic").option("startingOffsets","earliest").load() // groupby on slidings windows query = slidingWindowsDF.writeStream.queryName("bla").outputMode("complete").format("memory").start() The application is launched with the following parameters: spark.streaming.backpressure.initialRate 5 spark.streaming.backpressure

Unable to write PySpark Dataframe created from two zipped dataframes

阅读更多关于 Unable to write PySpark Dataframe created from two zipped dataframes

问题 I am trying to follow the example given here for combining two dataframes without a shared join key (combining by "index" in a database table or pandas dataframe, except that PySpark does not have that concept): My Code left_df = left_df.repartition(right_df.rdd.getNumPartitions()) # FWIW, num of partitions = 303 joined_schema = StructType(left_df.schema.fields + right_df.schema.fields) interim_rdd = left_df.rdd.zip(right_df.rdd).map(lambda x: x[0] + x[1]) full_data = spark.createDataFrame

How to distribute data evenly in Kafka producing messages through Spark?

阅读更多关于 How to distribute data evenly in Kafka producing messages through Spark?

问题 I have a streaming job that writes data into Kafka and I've noticed one of Kafka partitions (#3) takes more data then others. +-----------------------------------------------------+ | partition | messages | earlist offset | next offset| +-----------------------------------------------------+ |1 | 166522754 | 5861603324 | 6028126078 | |2 | 152251127 | 6010226633 | 6162477760 | |3 | 382935293 | 6332944925 | 6715880218 | |4 | 188126274 | 6171311709 | 6359437983 | |5 | 188270700 | 6100140089 |