apache-spark-sql | 易学教程

Maximum number of concurrent tasks in 1 DPU in AWS Glue

阅读更多关于 Maximum number of concurrent tasks in 1 DPU in AWS Glue

问题 A standard DPU in AWS Glue comes with 4 vCPU and 2 executors. I am confused about the maximum number of concurrent tasks that can be run in parallel with this configuration. Is it 4 or 8 on a single DPU with 4vcpu and 2 executors? 回答1: I had a similar discussion with the AWS Glue support team about this, I'll share with you what they told me about Glue Configuration. Take in example the Standard and the G1.X configuration. Standard DPU Configuration: 1 DPU reserved for MasterNode 1 executor

Filter Pyspark Dataframe with udf on entire row

阅读更多关于 Filter Pyspark Dataframe with udf on entire row

问题 Is there a way to select the entire row as a column to input into a Pyspark filter udf? I have a complex filtering function "my_filter" that I want to apply to the entire DataFrame: my_filter_udf = udf(lambda r: my_filter(r), BooleanType()) new_df = df.filter(my_filter_udf(col("*")) But col("*") throws an error because that's not a valid operation. I know that I can convert the dataframe to an RDD and then use the RDD's filter method, but I do NOT want to convert it to an RDD and then back

Joining Two Datasets with Predicate Pushdown

阅读更多关于 Joining Two Datasets with Predicate Pushdown

问题 I have a Dataset that i created from a RDD and try to join it with another Dataset which is created from my Phoenix Table : val dfToJoin = sparkSession.createDataset(rddToJoin) val tableDf = sparkSession .read .option("table", "table") .option("zkURL", "localhost") .format("org.apache.phoenix.spark") .load() val joinedDf = dfToJoin.join(tableDf, "columnToJoinOn") When i execute it, it seems that the whole database table is loaded to do the join. Is there a way to do such a join so that the

Joining Two Datasets with Predicate Pushdown

阅读更多关于 Joining Two Datasets with Predicate Pushdown

Joining Two Datasets with Predicate Pushdown

阅读更多关于 Joining Two Datasets with Predicate Pushdown

How to update a Static Dataframe with Streaming Dataframe in Spark structured streaming

阅读更多关于 How to update a Static Dataframe with Streaming Dataframe in Spark structured streaming

问题 I have a Static DataFrame with millions of rows as follows. Static DataFrame : -------------- id|time_stamp| -------------- |1|1540527851| |2|1540525602| |3|1530529187| |4|1520529185| |5|1510529182| |6|1578945709| -------------- Now in every batch, a Streaming DataFrame is being formed which contains id and updated time_stamp after some operations like below. In first Batch : -------------- id|time_stamp| -------------- |1|1540527888| |2|1540525999| |3|1530529784| -------------- Now in every

How to update a Static Dataframe with Streaming Dataframe in Spark structured streaming

阅读更多关于 How to update a Static Dataframe with Streaming Dataframe in Spark structured streaming

JDBC to Spark Dataframe - How to ensure even partitioning?

阅读更多关于 JDBC to Spark Dataframe - How to ensure even partitioning?

问题 I am new to Spark, and am working on creating a DataFrame from a Postgres database table via JDBC, using spark.read.jdbc . I am a bit confused about the partitioning options, in particular partitionColumn , lowerBound , upperBound , and numPartitions . The documentation seems to indicate that these fields are optional. What happens if I don't provide them? How does Spark know how to partition the queries? How efficient will that be? If I DO specify these options, how do I ensure that the

Apache Spark DataSet API : head(n:Int) vs take(n:Int)

阅读更多关于 Apache Spark DataSet API : head(n:Int) vs take(n:Int)

问题 Apache Spark Dataset API has two methods i.e, head(n:Int) and take(n:Int) . Dataset.Scala source contains def take(n: Int): Array[T] = head(n) Couldn't find any difference in execution code between these two functions. why do API has two different methods to yield the same result? 回答1: I have experimented & found that head(n) and take(n) gives exactly same replica output. Both produces output in the form of ROW object only. DF.head(2) [Row(Transaction_date=u'1/2/2009 6:17', Product=u'Product1

Spark: Read an inputStream instead of File

阅读更多关于 Spark: Read an inputStream instead of File

问题 I'm using SparkSQL in a Java application to do some processing on CSV files using Databricks for parsing. The data I am processing comes from different sources (Remote URL, local file, Google Cloud Storage), and I'm in the habit of turning everything into an InputStream so that I can parse and process data without knowing where it came from. All the documentation I've seen on Spark reads files from a path, e.g. SparkConf conf = new SparkConf().setAppName("spark-sandbox").setMaster("local");