pyspark

Pivot row to column level

独自空忆成欢 提交于 2021-02-04 21:46:27
问题 I have a spark dataframe t which is the result of a spark.sql("...") query. Here is the first few rows from t : | yyyy_mm_dd | x_id | x_name | b_app | status | has_policy | count | |------------|------|-------------|---------|---------------|------------|-------| | 2020-08-18 | 1 | first_name | content | no_contact | 1 | 23 | | 2020-08-18 | 1 | first_name | content | no_contact | 0 | 346 | | 2020-08-18 | 2 | second_name | content | implemented | 1 | 64 | | 2020-08-18 | 2 | second_name |

PySpark: create dataframe from random uniform disribution

笑着哭i 提交于 2021-02-04 19:08:37
问题 I am trying to create a dataframe using random uniform distribution in Spark. I couldn't find anything on how to create a dataframe but when I read the documentation I found that pyspark.mllib.random has a RandomRDDs object which has a uniformRDD method which can create rdds from random uniform distribution. But the problem is that it doesn't create two dimensional rdds. Is there a way I can create a two-dimensional rdd or (preferably) dataframe? I can create a few rdds and use them to create

PySpark: create dataframe from random uniform disribution

僤鯓⒐⒋嵵緔 提交于 2021-02-04 19:04:32
问题 I am trying to create a dataframe using random uniform distribution in Spark. I couldn't find anything on how to create a dataframe but when I read the documentation I found that pyspark.mllib.random has a RandomRDDs object which has a uniformRDD method which can create rdds from random uniform distribution. But the problem is that it doesn't create two dimensional rdds. Is there a way I can create a two-dimensional rdd or (preferably) dataframe? I can create a few rdds and use them to create

PySpark: create dataframe from random uniform disribution

99封情书 提交于 2021-02-04 19:04:10
问题 I am trying to create a dataframe using random uniform distribution in Spark. I couldn't find anything on how to create a dataframe but when I read the documentation I found that pyspark.mllib.random has a RandomRDDs object which has a uniformRDD method which can create rdds from random uniform distribution. But the problem is that it doesn't create two dimensional rdds. Is there a way I can create a two-dimensional rdd or (preferably) dataframe? I can create a few rdds and use them to create

How to solve pyspark `org.apache.arrow.vector.util.OversizedAllocationException` error by increasing spark's memory?

半腔热情 提交于 2021-02-04 18:59:28
问题 I'm running a job in pyspark where I at one point use a grouped aggregate Pandas UDF. This results in the following (here abbreviate) error: org.apache.arrow.vector.util.OversizedAllocationException: Unable to expand the buffer I'm fairly sure this is because one of the groups the pandas UDF receives is huge, and if I reduce the dataset and removes enough rows I can run my UDF with no problems. However, I want to run with my original dataset and even if I run this spark job on a machine with

How to solve pyspark `org.apache.arrow.vector.util.OversizedAllocationException` error by increasing spark's memory?

丶灬走出姿态 提交于 2021-02-04 18:58:12
问题 I'm running a job in pyspark where I at one point use a grouped aggregate Pandas UDF. This results in the following (here abbreviate) error: org.apache.arrow.vector.util.OversizedAllocationException: Unable to expand the buffer I'm fairly sure this is because one of the groups the pandas UDF receives is huge, and if I reduce the dataset and removes enough rows I can run my UDF with no problems. However, I want to run with my original dataset and even if I run this spark job on a machine with

How to solve pyspark `org.apache.arrow.vector.util.OversizedAllocationException` error by increasing spark's memory?

◇◆丶佛笑我妖孽 提交于 2021-02-04 18:58:09
问题 I'm running a job in pyspark where I at one point use a grouped aggregate Pandas UDF. This results in the following (here abbreviate) error: org.apache.arrow.vector.util.OversizedAllocationException: Unable to expand the buffer I'm fairly sure this is because one of the groups the pandas UDF receives is huge, and if I reduce the dataset and removes enough rows I can run my UDF with no problems. However, I want to run with my original dataset and even if I run this spark job on a machine with

Spark Structural Streaming with Confluent Cloud Kafka connectivity issue

落爺英雄遲暮 提交于 2021-02-04 16:41:16
问题 I am writing a Spark structured streaming application in PySpark to read data from Kafka in Confluent Cloud. The documentation for the spark readstream() function is too shallow and didn't specify much on the optional parameter part especially on the auth mechanism part. I am not sure what parameter goes wrong and crash the connectivity. Can anyone have experience in Spark help me to start this connection? Required Parameter > Consumer({'bootstrap.servers': > 'cluster.gcp.confluent.cloud:9092

Flatten Nested Struct in PySpark Array

跟風遠走 提交于 2021-02-04 16:37:26
问题 Given a schema like: root |-- first_name: string |-- last_name: string |-- degrees: array | |-- element: struct | | |-- school: string | | |-- advisors: struct | | | |-- advisor1: string | | | |-- advisor2: string How can I get a schema like: root |-- first_name: string |-- last_name: string |-- degrees: array | |-- element: struct | | |-- school: string | | |-- advisor1: string | | |-- advisor2: string Currently, I explode the array, flatten the structure by selecting advisor.* and then

Cassandra Sink for PySpark Structured Streaming from Kafka topic

久未见 提交于 2021-02-04 16:34:14
问题 I want to write Structure Streaming Data into Cassandra using PySpark Structured Streaming API. My data flow is like below: REST API -> Kafka -> Spark Structured Streaming (PySpark) -> Cassandra Source and Version in below: Spark version: 2.4.3 DataStax DSE: 6.7.6-1 initialize spark: spark = SparkSession.builder\ .master("local[*]")\ .appName("Analytics")\ .config("kafka.bootstrap.servers", "localhost:9092")\ .config("spark.cassandra.connection.host","localhost:9042")\ .getOrCreate()