pyspark | 易学教程

Errors for block matrix multiplification in Spark

阅读更多关于 Errors for block matrix multiplification in Spark

问题 I have created a coordinate matrix cmat with 9 million rows and 85K columns. I would like to perform cmat.T * cmat operations. I first converted cmat to block matrix bmat: bmat = cmat.toBlockMatrix(1000, 1000) However, I got errors when performing multiply(): mtm = bmat.transpose.multiply(bmat) Traceback (most recent call last): File "", line 1, in AttributeError: 'function' object has no attribute 'multiply' The Spark version is 2.2.0, scale version is 2.11.8 on DataProc, Google cloud

How do I replace a string value with a NULL in PySpark for all my columns in the dataframe?

阅读更多关于 How do I replace a string value with a NULL in PySpark for all my columns in the dataframe?

问题 As an example say I have a df from pyspark.sql import Row row = Row("v", "x", "y", "z") df = sc.parallelize([ row("p", 1, 2, 3.0), row("NULL", 3, "NULL", 5.0), row("NA", None, 6, 7.0), row(float("Nan"), 8, "NULL", float("NaN")) ]).toDF() Now I want to replace NULL, NA and NaN by pyspark null (None) value. How do I achieve it for multiple columns together. from pyspark.sql.functions import when, lit, col def replace(column, value): return when(column != value, column).otherwise(lit(None)) df =

CombineByKey works fine with pyspark python 2 but not python 3 [duplicate]

阅读更多关于 CombineByKey works fine with pyspark python 2 but not python 3 [duplicate]

问题 This question already has an answer here : Nested arguments not compiling (1 answer) Closed 3 months ago . The following works fine with pyspark using python2: data = [ ('A', 2.), ('A', 4.), ('A', 9.), ('B', 10.), ('B', 20.), ('Z', 3.), ('Z', 5.), ('Z', 8.), ('Z', 12.) ] rdd = sc.parallelize( data ) sumCount = rdd.combineByKey(lambda value: (value, 1), lambda x, value: (x[0] + value, x[1] + 1), lambda x, y: (x[0] + y[0], x[1] + y[1]) ) averageByKey = sumCount.map(lambda (key, (totalSum, count

Would a forced Spark DataFrame materialization work as a checkpoint?

阅读更多关于 Would a forced Spark DataFrame materialization work as a checkpoint?

问题 I have a large and complex DataFrame with nested structures in Spark 2.1.0 (pySpark) and I want to add an ID column to it. The way I did it was to add a column like this: df= df.selectExpr('*','row_number() OVER (PARTITION BY File ORDER BY NULL) AS ID') So it goes e.g. from this: File A B a.txt valA1 [valB11,valB12] a.txt valA2 [valB21,valB22] to this: File A B ID a.txt valA1 [valB11,valB12] 1 a.txt valA2 [valB21,valB22] 2 After I add this column, I don't immediately trigger a materialization

PySpark adding executors makes app slower

阅读更多关于 PySpark adding executors makes app slower

问题 Whenever I add more than 10 executors my jobs start to become a lot slower. Greater than 15 executors and my jobs start to crash. I generally use 4 cores per executor but have tried 2-5. I am using yarn and PySpark 2.1. Errors I receive: ERROR TransportRequestHandler: Error sending result RpcResponse WARN NettyRpcEndpointRef: Error sending message Future timed out after [10 seconds] I have read that most people get this error becomes of OOM errors but that is not in my stderr logs anywhere. I

python list of usa holidays between a range

阅读更多关于 python list of usa holidays between a range

问题 I have a need to fetch list of holidays in a given range, i.e., if start date is 20/12/2016 & end date is 10/1/2017, then I should get 25/12/2017, 1/1/2017. I can do this using Pandas, but in my case, I have limitation that I need to AWS Glue service & Pandas are not supported in AWS Glue. I am trying to use native python library holidays, but I couldn't see API document to fetch holidays between from & to date? Here is what I have tried: import holidays import datetime from datetime import

Spark Streaming reduceByKeyAndWindow for moving average calculation

阅读更多关于 Spark Streaming reduceByKeyAndWindow for moving average calculation

问题 I need to calculate a moving average from a kinesis stream of data. I will have a sliding window size and slide as inputs and need to calculate the moving average and plot it. I understand how to use reduceByKeyAndWindow from the docs to get a rolling sum. I understand how to get the counts per window as well. I am not clear on how to use these to get the average. Nor am I sure how to define an average calculator function in the reduceByKeyAndWindow. Any help would be appreciated. Sample code

Python Spark Streaming example with textFileStream does not work. Why?

阅读更多关于 Python Spark Streaming example with textFileStream does not work. Why?

问题 I use spark 1.3.1 and Python 2.7 It is my first experience with Spark Streaming. I try example of code, which reads data from file using spark streaming. This is link to example: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py My code is the following: conf = (SparkConf() .setMaster("local") .setAppName("My app") .set("spark.executor.memory", "1g")) sc = SparkContext(conf = conf) ssc = StreamingContext(sc, 1) lines = ssc.textFileStream('..

How to bucketize a group of columns in pyspark?

阅读更多关于 How to bucketize a group of columns in pyspark?

问题 I am trying to bucketize columns that contain the word "road" in a 5k dataset. And create a new dataframe. I am not sure how to do that, here is what I have tried far : from pyspark.ml.feature import Bucketizer spike_cols = [col for col in df.columns if "road" in col] for x in spike_cols : bucketizer = Bucketizer(splits=[-float("inf"), 10, 100, float("inf")], inputCol=x, outputCol=x + "bucket") bucketedData = bucketizer.transform(df) 回答1: Either modify df in the loop: from pyspark.ml.feature

Improve speed of spark app

阅读更多关于 Improve speed of spark app

问题 This is part of my python-spark code which parts of it run too slow for my needs. Especially this part of the code, which I would really like to improve it's speed but don't know how to. It currently takes around 1 minute for 60 Million data rows and I would like to improve it to under 10 seconds. sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="axes", keyspace=source).load() More context of my spark app: article_ids = sqlContext.read.format("org.apache.spark.sql