apache-spark-sql

Maximum number of concurrent tasks in 1 DPU in AWS Glue

亡梦爱人 提交于 2020-08-26 04:23:31
问题 A standard DPU in AWS Glue comes with 4 vCPU and 2 executors. I am confused about the maximum number of concurrent tasks that can be run in parallel with this configuration. Is it 4 or 8 on a single DPU with 4vcpu and 2 executors? 回答1: I had a similar discussion with the AWS Glue support team about this, I'll share with you what they told me about Glue Configuration. Take in example the Standard and the G1.X configuration. Standard DPU Configuration: 1 DPU reserved for MasterNode 1 executor

Filter Pyspark Dataframe with udf on entire row

那年仲夏 提交于 2020-08-25 07:33:49
问题 Is there a way to select the entire row as a column to input into a Pyspark filter udf? I have a complex filtering function "my_filter" that I want to apply to the entire DataFrame: my_filter_udf = udf(lambda r: my_filter(r), BooleanType()) new_df = df.filter(my_filter_udf(col("*")) But col("*") throws an error because that's not a valid operation. I know that I can convert the dataframe to an RDD and then use the RDD's filter method, but I do NOT want to convert it to an RDD and then back

Joining Two Datasets with Predicate Pushdown

China☆狼群 提交于 2020-08-25 04:04:09
问题 I have a Dataset that i created from a RDD and try to join it with another Dataset which is created from my Phoenix Table : val dfToJoin = sparkSession.createDataset(rddToJoin) val tableDf = sparkSession .read .option("table", "table") .option("zkURL", "localhost") .format("org.apache.phoenix.spark") .load() val joinedDf = dfToJoin.join(tableDf, "columnToJoinOn") When i execute it, it seems that the whole database table is loaded to do the join. Is there a way to do such a join so that the

Joining Two Datasets with Predicate Pushdown

浪子不回头ぞ 提交于 2020-08-25 04:03:06
问题 I have a Dataset that i created from a RDD and try to join it with another Dataset which is created from my Phoenix Table : val dfToJoin = sparkSession.createDataset(rddToJoin) val tableDf = sparkSession .read .option("table", "table") .option("zkURL", "localhost") .format("org.apache.phoenix.spark") .load() val joinedDf = dfToJoin.join(tableDf, "columnToJoinOn") When i execute it, it seems that the whole database table is loaded to do the join. Is there a way to do such a join so that the

Joining Two Datasets with Predicate Pushdown

丶灬走出姿态 提交于 2020-08-25 04:02:24
问题 I have a Dataset that i created from a RDD and try to join it with another Dataset which is created from my Phoenix Table : val dfToJoin = sparkSession.createDataset(rddToJoin) val tableDf = sparkSession .read .option("table", "table") .option("zkURL", "localhost") .format("org.apache.phoenix.spark") .load() val joinedDf = dfToJoin.join(tableDf, "columnToJoinOn") When i execute it, it seems that the whole database table is loaded to do the join. Is there a way to do such a join so that the

How to update a Static Dataframe with Streaming Dataframe in Spark structured streaming

陌路散爱 提交于 2020-08-24 10:33:59
问题 I have a Static DataFrame with millions of rows as follows. Static DataFrame : -------------- id|time_stamp| -------------- |1|1540527851| |2|1540525602| |3|1530529187| |4|1520529185| |5|1510529182| |6|1578945709| -------------- Now in every batch, a Streaming DataFrame is being formed which contains id and updated time_stamp after some operations like below. In first Batch : -------------- id|time_stamp| -------------- |1|1540527888| |2|1540525999| |3|1530529784| -------------- Now in every

How to update a Static Dataframe with Streaming Dataframe in Spark structured streaming

牧云@^-^@ 提交于 2020-08-24 10:31:14
问题 I have a Static DataFrame with millions of rows as follows. Static DataFrame : -------------- id|time_stamp| -------------- |1|1540527851| |2|1540525602| |3|1530529187| |4|1520529185| |5|1510529182| |6|1578945709| -------------- Now in every batch, a Streaming DataFrame is being formed which contains id and updated time_stamp after some operations like below. In first Batch : -------------- id|time_stamp| -------------- |1|1540527888| |2|1540525999| |3|1530529784| -------------- Now in every

JDBC to Spark Dataframe - How to ensure even partitioning?

岁酱吖の 提交于 2020-08-24 08:16:23
问题 I am new to Spark, and am working on creating a DataFrame from a Postgres database table via JDBC, using spark.read.jdbc . I am a bit confused about the partitioning options, in particular partitionColumn , lowerBound , upperBound , and numPartitions . The documentation seems to indicate that these fields are optional. What happens if I don't provide them? How does Spark know how to partition the queries? How efficient will that be? If I DO specify these options, how do I ensure that the

Apache Spark DataSet API : head(n:Int) vs take(n:Int)

老子叫甜甜 提交于 2020-08-23 03:45:46
问题 Apache Spark Dataset API has two methods i.e, head(n:Int) and take(n:Int) . Dataset.Scala source contains def take(n: Int): Array[T] = head(n) Couldn't find any difference in execution code between these two functions. why do API has two different methods to yield the same result? 回答1: I have experimented & found that head(n) and take(n) gives exactly same replica output. Both produces output in the form of ROW object only. DF.head(2) [Row(Transaction_date=u'1/2/2009 6:17', Product=u'Product1

Spark: Read an inputStream instead of File

Deadly 提交于 2020-08-22 09:27:20
问题 I'm using SparkSQL in a Java application to do some processing on CSV files using Databricks for parsing. The data I am processing comes from different sources (Remote URL, local file, Google Cloud Storage), and I'm in the habit of turning everything into an InputStream so that I can parse and process data without knowing where it came from. All the documentation I've seen on Spark reads files from a path, e.g. SparkConf conf = new SparkConf().setAppName("spark-sandbox").setMaster("local");