pyspark | 易学教程

Pivot row to column level

阅读更多关于 Pivot row to column level

问题 I have a spark dataframe t which is the result of a spark.sql("...") query. Here is the first few rows from t : | yyyy_mm_dd | x_id | x_name | b_app | status | has_policy | count | |------------|------|-------------|---------|---------------|------------|-------| | 2020-08-18 | 1 | first_name | content | no_contact | 1 | 23 | | 2020-08-18 | 1 | first_name | content | no_contact | 0 | 346 | | 2020-08-18 | 2 | second_name | content | implemented | 1 | 64 | | 2020-08-18 | 2 | second_name |

PySpark: create dataframe from random uniform disribution

阅读更多关于 PySpark: create dataframe from random uniform disribution

问题 I am trying to create a dataframe using random uniform distribution in Spark. I couldn't find anything on how to create a dataframe but when I read the documentation I found that pyspark.mllib.random has a RandomRDDs object which has a uniformRDD method which can create rdds from random uniform distribution. But the problem is that it doesn't create two dimensional rdds. Is there a way I can create a two-dimensional rdd or (preferably) dataframe? I can create a few rdds and use them to create

PySpark: create dataframe from random uniform disribution

阅读更多关于 PySpark: create dataframe from random uniform disribution

PySpark: create dataframe from random uniform disribution

阅读更多关于 PySpark: create dataframe from random uniform disribution

How to solve pyspark `org.apache.arrow.vector.util.OversizedAllocationException` error by increasing spark's memory?

阅读更多关于 How to solve pyspark `org.apache.arrow.vector.util.OversizedAllocationException` error by increasing spark's memory?

问题 I'm running a job in pyspark where I at one point use a grouped aggregate Pandas UDF. This results in the following (here abbreviate) error: org.apache.arrow.vector.util.OversizedAllocationException: Unable to expand the buffer I'm fairly sure this is because one of the groups the pandas UDF receives is huge, and if I reduce the dataset and removes enough rows I can run my UDF with no problems. However, I want to run with my original dataset and even if I run this spark job on a machine with

How to solve pyspark `org.apache.arrow.vector.util.OversizedAllocationException` error by increasing spark's memory?

阅读更多关于 How to solve pyspark `org.apache.arrow.vector.util.OversizedAllocationException` error by increasing spark's memory?

How to solve pyspark `org.apache.arrow.vector.util.OversizedAllocationException` error by increasing spark's memory?

阅读更多关于 How to solve pyspark `org.apache.arrow.vector.util.OversizedAllocationException` error by increasing spark's memory?

Spark Structural Streaming with Confluent Cloud Kafka connectivity issue

阅读更多关于 Spark Structural Streaming with Confluent Cloud Kafka connectivity issue

问题 I am writing a Spark structured streaming application in PySpark to read data from Kafka in Confluent Cloud. The documentation for the spark readstream() function is too shallow and didn't specify much on the optional parameter part especially on the auth mechanism part. I am not sure what parameter goes wrong and crash the connectivity. Can anyone have experience in Spark help me to start this connection? Required Parameter > Consumer({'bootstrap.servers': > 'cluster.gcp.confluent.cloud:9092

Flatten Nested Struct in PySpark Array

阅读更多关于 Flatten Nested Struct in PySpark Array

Cassandra Sink for PySpark Structured Streaming from Kafka topic

阅读更多关于 Cassandra Sink for PySpark Structured Streaming from Kafka topic

问题 I want to write Structure Streaming Data into Cassandra using PySpark Structured Streaming API. My data flow is like below: REST API -> Kafka -> Spark Structured Streaming (PySpark) -> Cassandra Source and Version in below: Spark version: 2.4.3 DataStax DSE: 6.7.6-1 initialize spark: spark = SparkSession.builder\ .master("local[*]")\ .appName("Analytics")\ .config("kafka.bootstrap.servers", "localhost:9092")\ .config("spark.cassandra.connection.host","localhost:9042")\ .getOrCreate()