pyspark | 易学教程

PySpark OutOfMemoryErrors when performing many dataframe joins

阅读更多关于 PySpark OutOfMemoryErrors when performing many dataframe joins

问题 There's many posts about this issue, but none have answered my question. I'm running into OutOfMemoryError s in PySpark while attempting to join many different dataframes together. My local machine has 16GB of memory, and I've set my Spark configurations as such: class SparkRawConsumer: def __init__(self, filename, reference_date, FILM_DATA): self.sparkContext = SparkContext(master='local[*]', appName='my_app') SparkContext.setSystemProperty('spark.executor.memory', '3g') SparkContext

Spark streaming data pipelines on Dataproc experiencing sudden frequent socket timeouts

阅读更多关于 Spark streaming data pipelines on Dataproc experiencing sudden frequent socket timeouts

问题 I am using Spark streaming on Google Cloud Dataproc for executing a framework (written in Python) which consists of several continuous pipelines, each representing a single job on Dataproc, which basically read from Kafka queues and write the transformed output to Bigtable. All pipelines combined handle several gigabytes of data per day via 2 clusters, one with 3 worker nodes and one with 4. Running this Spark streaming framework on top of Dataproc has been fairly stable until the beginning

Spark streaming data pipelines on Dataproc experiencing sudden frequent socket timeouts

阅读更多关于 Spark streaming data pipelines on Dataproc experiencing sudden frequent socket timeouts

Normalize values of multiple columns in Spark DataFrame, using only DataFrame API

阅读更多关于 Normalize values of multiple columns in Spark DataFrame, using only DataFrame API

问题 I am trying to normalize the values of multiple columns in a spark dataframe, by subtracting the mean and dividing by the stddev of each column. Here's the code I have so far: from pyspark.sql import Row from pyspark.sql.functions import stddev_pop, avg df = spark.createDataFrame([Row(A=1, B=6), Row(A=2, B=7), Row(A=3, B=8), Row(A=4, B=9), Row(A=5, B=10)]) exprs = [x - (avg(x)) / stddev_pop(x) for x in df.columns] df.select(exprs).show() Which gives me the result: +---------------------------

How do I get the last item from a list using pyspark?

阅读更多关于 How do I get the last item from a list using pyspark?

问题 Why does column 1st_from_end contain null: from pyspark.sql.functions import split df = sqlContext.createDataFrame([('a b c d',)], ['s',]) df.select( split(df.s, ' ')[0].alias('0th'), split(df.s, ' ')[3].alias('3rd'), split(df.s, ' ')[-1].alias('1st_from_end') ).show() I thought using [-1] was a pythonic way to get the last item in a list. How come it doesn't work in pyspark? 回答1: If you're using Spark >= 2.4.0 see jxc's answer below. In Spark < 2.4.0, dataframes API didn't support -1

Implausibly spark dataframe after read ocr files from hdfs

阅读更多关于 Implausibly spark dataframe after read ocr files from hdfs

问题 I have a problem using spark 2.1.1 and hadoop 2.6 on Ambari. I tested my code on my local computer first (single node, local files) and everything works as expected: from pyspark.sql import SparkSession spark = SparkSession\ .builder\ .master('yarn')\ .appName('localTest')\ .getOrCreate() data = spark.read.format('orc').load('mydata/*.orc') data.select('colname').na.drop().describe(['colname']).show() +-------+------------------+ |summary| colname | +-------+------------------+ | count|

Multiplying two columns in a pyspark dataframe. One of them contains a vector and one of them contains a constant

阅读更多关于 Multiplying two columns in a pyspark dataframe. One of them contains a vector and one of them contains a constant

问题 I have a pyspark dataframe which has one Column with vector values and one column with constant numerical values. Say for example A | B 1 | [2,4,5] 5 | [6,5,3] I want to multiple the vector column with the constant column. I’m trying to do this basically cause I have word wmbeddings in the B column and some weights in the A column. And my final purpose to get weighted embeddings. 回答1: If your vector data is stored as an array of doubles, you can do this: import breeze.linalg.{Vector => BV}

unzip list of tuples in pyspark dataframe

阅读更多关于 unzip list of tuples in pyspark dataframe

问题 I want unzip list of tuples in a column of a pyspark dataframe Let's say a column as [(blue, 0.5), (red, 0.1), (green, 0.7)] , I want to split into two columns, with first column as [blue, red, green] and second column as [0.5, 0.1, 0.7] +-----+-------------------------------------------+ |Topic| Tokens | +-----+-------------------------------------------+ | 1| ('blue', 0.5),('red', 0.1),('green', 0.7)| | 2| ('red', 0.9),('cyan', 0.5),('white', 0.4)| +-----+-----------------------------------

Generate Synthetic keys to map many to many relationship

阅读更多关于 Generate Synthetic keys to map many to many relationship

问题 Iam trying to create a unique synthetic key after identifying relationships between original keys. My DataFrame: Key Value K1 1 K2 2 K2 3 K1 3 K2 4 K1 5 K3 6 K4 6 K5 7 Expected Result: Key Value New_Key K1 1 NK1 K2 2 NK1 K2 3 NK1 K1 3 NK1 K2 4 NK1 K1 5 NK1 K2 6 NK2 K3 6 NK2 K4 7 NK3 I look forward to a response in python 3.0 or pyspark. I tried it with this code: #Import libraries# import networkx as nx import pandas as pd #Create DF# d1=pd.DataFrame({'Key','Value'}) #Create Empty Graph# G=nx

How to assign the access control list (ACL) when writing a CSV file to AWS in pyspark (2.2.0)?

阅读更多关于 How to assign the access control list (ACL) when writing a CSV file to AWS in pyspark (2.2.0)?

问题 I know I can output my spark dataframe to AWS S3 as a CSV file by df.repartition(1).write.csv('s3://my-bucket-name/df_name') My question is that is there an easy way to set the Access Control List (ACL) of this file to 'bucket-owner-full-control' when writing it to S3 using pyspark? 回答1: Don't know about the EMR s3 connector; in the ASF S3A connector you set the option fs.s3a.acl.default when you open the connection: you can't set it on a file-by-file basis 回答2: Access Control List (ACL) can