pyspark

PySpark OutOfMemoryErrors when performing many dataframe joins

╄→гoц情女王★ 提交于 2020-01-15 02:38:08
问题 There's many posts about this issue, but none have answered my question. I'm running into OutOfMemoryError s in PySpark while attempting to join many different dataframes together. My local machine has 16GB of memory, and I've set my Spark configurations as such: class SparkRawConsumer: def __init__(self, filename, reference_date, FILM_DATA): self.sparkContext = SparkContext(master='local[*]', appName='my_app') SparkContext.setSystemProperty('spark.executor.memory', '3g') SparkContext

Spark streaming data pipelines on Dataproc experiencing sudden frequent socket timeouts

不打扰是莪最后的温柔 提交于 2020-01-14 19:26:27
问题 I am using Spark streaming on Google Cloud Dataproc for executing a framework (written in Python) which consists of several continuous pipelines, each representing a single job on Dataproc, which basically read from Kafka queues and write the transformed output to Bigtable. All pipelines combined handle several gigabytes of data per day via 2 clusters, one with 3 worker nodes and one with 4. Running this Spark streaming framework on top of Dataproc has been fairly stable until the beginning

Spark streaming data pipelines on Dataproc experiencing sudden frequent socket timeouts

白昼怎懂夜的黑 提交于 2020-01-14 19:25:13
问题 I am using Spark streaming on Google Cloud Dataproc for executing a framework (written in Python) which consists of several continuous pipelines, each representing a single job on Dataproc, which basically read from Kafka queues and write the transformed output to Bigtable. All pipelines combined handle several gigabytes of data per day via 2 clusters, one with 3 worker nodes and one with 4. Running this Spark streaming framework on top of Dataproc has been fairly stable until the beginning

Normalize values of multiple columns in Spark DataFrame, using only DataFrame API

前提是你 提交于 2020-01-14 17:15:29
问题 I am trying to normalize the values of multiple columns in a spark dataframe, by subtracting the mean and dividing by the stddev of each column. Here's the code I have so far: from pyspark.sql import Row from pyspark.sql.functions import stddev_pop, avg df = spark.createDataFrame([Row(A=1, B=6), Row(A=2, B=7), Row(A=3, B=8), Row(A=4, B=9), Row(A=5, B=10)]) exprs = [x - (avg(x)) / stddev_pop(x) for x in df.columns] df.select(exprs).show() Which gives me the result: +---------------------------

How do I get the last item from a list using pyspark?

ε祈祈猫儿з 提交于 2020-01-14 08:57:29
问题 Why does column 1st_from_end contain null: from pyspark.sql.functions import split df = sqlContext.createDataFrame([('a b c d',)], ['s',]) df.select( split(df.s, ' ')[0].alias('0th'), split(df.s, ' ')[3].alias('3rd'), split(df.s, ' ')[-1].alias('1st_from_end') ).show() I thought using [-1] was a pythonic way to get the last item in a list. How come it doesn't work in pyspark? 回答1: If you're using Spark >= 2.4.0 see jxc's answer below. In Spark < 2.4.0, dataframes API didn't support -1

Implausibly spark dataframe after read ocr files from hdfs

血红的双手。 提交于 2020-01-14 08:01:08
问题 I have a problem using spark 2.1.1 and hadoop 2.6 on Ambari. I tested my code on my local computer first (single node, local files) and everything works as expected: from pyspark.sql import SparkSession spark = SparkSession\ .builder\ .master('yarn')\ .appName('localTest')\ .getOrCreate() data = spark.read.format('orc').load('mydata/*.orc') data.select('colname').na.drop().describe(['colname']).show() +-------+------------------+ |summary| colname | +-------+------------------+ | count|

Multiplying two columns in a pyspark dataframe. One of them contains a vector and one of them contains a constant

痞子三分冷 提交于 2020-01-14 07:01:32
问题 I have a pyspark dataframe which has one Column with vector values and one column with constant numerical values. Say for example A | B 1 | [2,4,5] 5 | [6,5,3] I want to multiple the vector column with the constant column. I’m trying to do this basically cause I have word wmbeddings in the B column and some weights in the A column. And my final purpose to get weighted embeddings. 回答1: If your vector data is stored as an array of doubles, you can do this: import breeze.linalg.{Vector => BV}

unzip list of tuples in pyspark dataframe

与世无争的帅哥 提交于 2020-01-14 06:52:24
问题 I want unzip list of tuples in a column of a pyspark dataframe Let's say a column as [(blue, 0.5), (red, 0.1), (green, 0.7)] , I want to split into two columns, with first column as [blue, red, green] and second column as [0.5, 0.1, 0.7] +-----+-------------------------------------------+ |Topic| Tokens | +-----+-------------------------------------------+ | 1| ('blue', 0.5),('red', 0.1),('green', 0.7)| | 2| ('red', 0.9),('cyan', 0.5),('white', 0.4)| +-----+-----------------------------------

Generate Synthetic keys to map many to many relationship

淺唱寂寞╮ 提交于 2020-01-14 06:34:28
问题 Iam trying to create a unique synthetic key after identifying relationships between original keys. My DataFrame: Key Value K1 1 K2 2 K2 3 K1 3 K2 4 K1 5 K3 6 K4 6 K5 7 Expected Result: Key Value New_Key K1 1 NK1 K2 2 NK1 K2 3 NK1 K1 3 NK1 K2 4 NK1 K1 5 NK1 K2 6 NK2 K3 6 NK2 K4 7 NK3 I look forward to a response in python 3.0 or pyspark. I tried it with this code: #Import libraries# import networkx as nx import pandas as pd #Create DF# d1=pd.DataFrame({'Key','Value'}) #Create Empty Graph# G=nx

How to assign the access control list (ACL) when writing a CSV file to AWS in pyspark (2.2.0)?

余生颓废 提交于 2020-01-14 06:27:27
问题 I know I can output my spark dataframe to AWS S3 as a CSV file by df.repartition(1).write.csv('s3://my-bucket-name/df_name') My question is that is there an easy way to set the Access Control List (ACL) of this file to 'bucket-owner-full-control' when writing it to S3 using pyspark? 回答1: Don't know about the EMR s3 connector; in the ASF S3A connector you set the option fs.s3a.acl.default when you open the connection: you can't set it on a file-by-file basis 回答2: Access Control List (ACL) can