pyspark

Errors for block matrix multiplification in Spark

旧巷老猫 提交于 2019-12-24 20:30:29
问题 I have created a coordinate matrix cmat with 9 million rows and 85K columns. I would like to perform cmat.T * cmat operations. I first converted cmat to block matrix bmat: bmat = cmat.toBlockMatrix(1000, 1000) However, I got errors when performing multiply(): mtm = bmat.transpose.multiply(bmat) Traceback (most recent call last): File "", line 1, in AttributeError: 'function' object has no attribute 'multiply' The Spark version is 2.2.0, scale version is 2.11.8 on DataProc, Google cloud

How do I replace a string value with a NULL in PySpark for all my columns in the dataframe?

泄露秘密 提交于 2019-12-24 20:26:26
问题 As an example say I have a df from pyspark.sql import Row row = Row("v", "x", "y", "z") df = sc.parallelize([ row("p", 1, 2, 3.0), row("NULL", 3, "NULL", 5.0), row("NA", None, 6, 7.0), row(float("Nan"), 8, "NULL", float("NaN")) ]).toDF() Now I want to replace NULL, NA and NaN by pyspark null (None) value. How do I achieve it for multiple columns together. from pyspark.sql.functions import when, lit, col def replace(column, value): return when(column != value, column).otherwise(lit(None)) df =

CombineByKey works fine with pyspark python 2 but not python 3 [duplicate]

醉酒当歌 提交于 2019-12-24 20:24:24
问题 This question already has an answer here : Nested arguments not compiling (1 answer) Closed 3 months ago . The following works fine with pyspark using python2: data = [ ('A', 2.), ('A', 4.), ('A', 9.), ('B', 10.), ('B', 20.), ('Z', 3.), ('Z', 5.), ('Z', 8.), ('Z', 12.) ] rdd = sc.parallelize( data ) sumCount = rdd.combineByKey(lambda value: (value, 1), lambda x, value: (x[0] + value, x[1] + 1), lambda x, y: (x[0] + y[0], x[1] + y[1]) ) averageByKey = sumCount.map(lambda (key, (totalSum, count

Would a forced Spark DataFrame materialization work as a checkpoint?

橙三吉。 提交于 2019-12-24 19:40:30
问题 I have a large and complex DataFrame with nested structures in Spark 2.1.0 (pySpark) and I want to add an ID column to it. The way I did it was to add a column like this: df= df.selectExpr('*','row_number() OVER (PARTITION BY File ORDER BY NULL) AS ID') So it goes e.g. from this: File A B a.txt valA1 [valB11,valB12] a.txt valA2 [valB21,valB22] to this: File A B ID a.txt valA1 [valB11,valB12] 1 a.txt valA2 [valB21,valB22] 2 After I add this column, I don't immediately trigger a materialization

PySpark adding executors makes app slower

断了今生、忘了曾经 提交于 2019-12-24 18:50:35
问题 Whenever I add more than 10 executors my jobs start to become a lot slower. Greater than 15 executors and my jobs start to crash. I generally use 4 cores per executor but have tried 2-5. I am using yarn and PySpark 2.1. Errors I receive: ERROR TransportRequestHandler: Error sending result RpcResponse WARN NettyRpcEndpointRef: Error sending message Future timed out after [10 seconds] I have read that most people get this error becomes of OOM errors but that is not in my stderr logs anywhere. I

python list of usa holidays between a range

青春壹個敷衍的年華 提交于 2019-12-24 18:43:47
问题 I have a need to fetch list of holidays in a given range, i.e., if start date is 20/12/2016 & end date is 10/1/2017, then I should get 25/12/2017, 1/1/2017. I can do this using Pandas, but in my case, I have limitation that I need to AWS Glue service & Pandas are not supported in AWS Glue. I am trying to use native python library holidays, but I couldn't see API document to fetch holidays between from & to date? Here is what I have tried: import holidays import datetime from datetime import

Spark Streaming reduceByKeyAndWindow for moving average calculation

不羁岁月 提交于 2019-12-24 17:58:28
问题 I need to calculate a moving average from a kinesis stream of data. I will have a sliding window size and slide as inputs and need to calculate the moving average and plot it. I understand how to use reduceByKeyAndWindow from the docs to get a rolling sum. I understand how to get the counts per window as well. I am not clear on how to use these to get the average. Nor am I sure how to define an average calculator function in the reduceByKeyAndWindow. Any help would be appreciated. Sample code

Python Spark Streaming example with textFileStream does not work. Why?

点点圈 提交于 2019-12-24 17:57:06
问题 I use spark 1.3.1 and Python 2.7 It is my first experience with Spark Streaming. I try example of code, which reads data from file using spark streaming. This is link to example: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py My code is the following: conf = (SparkConf() .setMaster("local") .setAppName("My app") .set("spark.executor.memory", "1g")) sc = SparkContext(conf = conf) ssc = StreamingContext(sc, 1) lines = ssc.textFileStream('..

How to bucketize a group of columns in pyspark?

时光怂恿深爱的人放手 提交于 2019-12-24 17:04:24
问题 I am trying to bucketize columns that contain the word "road" in a 5k dataset. And create a new dataframe. I am not sure how to do that, here is what I have tried far : from pyspark.ml.feature import Bucketizer spike_cols = [col for col in df.columns if "road" in col] for x in spike_cols : bucketizer = Bucketizer(splits=[-float("inf"), 10, 100, float("inf")], inputCol=x, outputCol=x + "bucket") bucketedData = bucketizer.transform(df) 回答1: Either modify df in the loop: from pyspark.ml.feature

Improve speed of spark app

被刻印的时光 ゝ 提交于 2019-12-24 16:43:29
问题 This is part of my python-spark code which parts of it run too slow for my needs. Especially this part of the code, which I would really like to improve it's speed but don't know how to. It currently takes around 1 minute for 60 Million data rows and I would like to improve it to under 10 seconds. sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="axes", keyspace=source).load() More context of my spark app: article_ids = sqlContext.read.format("org.apache.spark.sql