pyspark | 易学教程

Saving data back into Cassandra as RDD

阅读更多关于 Saving data back into Cassandra as RDD

问题 I am trying to read messages from Kafka, process the data, and then add the data into cassandra as if it is an RDD. My trouble is saving the data back into cassandra. from __future__ import print_function from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils from pyspark import SparkConf, SparkContext appName = 'Kafka_Cassandra_Test' kafkaBrokers = '1.2.3.4:9092' topic = 'test' cassandraHosts = '1,2,3' sparkMaster = 'spark://mysparkmaster:7077' if _

Difference between Caching mechanism in Spark SQL

阅读更多关于 Difference between Caching mechanism in Spark SQL

问题 I am trying to wrap my head around various caching mechanisms in Spark SQL. Is there any difference between the following code snippets: Method 1: cache table test_cache AS select a, b, c from x inner join y on x.a = y.a; Method 2: create temporary view test_cache AS select a, b, c from x inner join y on x.a = y.a; cache table test_cache; Since computations in Spark are Lazy, will Spark cache the results the very first time the temp table is created in Method 2 ? Or will it wait for any

Difference between Caching mechanism in Spark SQL

阅读更多关于 Difference between Caching mechanism in Spark SQL

Why can't I load a PySpark RandomForestClassifier model?

阅读更多关于 Why can't I load a PySpark RandomForestClassifier model?

问题 I can't load a RandomForestClassificationModel saved by Spark. Environment: Apache Spark 2.0.1, standalone mode running on a small (4 machine) cluster. No HDFS - everything is saved to local disks. Build and save model: classifier = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=50) model = classifier.fit(train) result = model.transform(test) model.write().save("/tmp/models/20161030-RF-topics-cats.model") Later, in a separate program: model =

Pyspark java.lang.OutOfMemoryError: Requested array size exceeds VM limit

阅读更多关于 Pyspark java.lang.OutOfMemoryError: Requested array size exceeds VM limit

问题 I am running a Pyspark job: spark-submit --master yarn-client --driver-memory 150G --num-executors 8 --executor-cores 4 --executor-memory 150G benchmark_script_1.py hdfs:///tmp/data/sample150k 128 hdfs:///tmp/output/sample150k | tee ~/output/sample150k.log The job itself is pretty standard. It just grabs some files and counts them.: print(str(datetime.now()) + " - Ingesting files...") files = sc.wholeTextFiles(inputFileDir, partitions) fileCount = files.count() print(str(datetime.now()) + " -

pyspark interpreter not found in apache zeppelin

阅读更多关于 pyspark interpreter not found in apache zeppelin

问题 I am having issue with using pyspark in Apache-Zeppelin (version 0.6.0) notebook. Running the following simple code gives me pyspark interpreter not found error %pyspark a = 1+3 Running sc.version gave me res2: String = 1.6.0 which is the version of spark installed on my machine. And running z return res0: org.apache.zeppelin.spark.ZeppelinContext = {} Pyspark works from CLI (using spark 1.6.0 and python 2.6.6) The default python on the machine 2.6.6, while anaconda-python 3.5 is also

Spark RDD partition by key in exclusive way

阅读更多关于 Spark RDD partition by key in exclusive way

问题 I would like to partition an RDD by key and have that each parition contains only values of a single key. For example, if I have 100 different values of the key and I repartition(102) , the RDD should have 2 empty partitions and 100 partitions containing each one a single key value. I tried with groupByKey(k).repartition(102) but this does not guarantee the exclusivity of a key in each partition, since I see some partitions containing more values of a single key and more than 2 empty. Is

Using Python's reduce() to join multiple PySpark DataFrames

阅读更多关于 Using Python's reduce() to join multiple PySpark DataFrames

问题 Does anyone know why using Python3's functools.reduce() would lead to worse performance when joining multiple PySpark DataFrames than just iteratively joining the same DataFrames using a for loop? Specifically, this gives a massive slowdown followed by an out-of-memory error: def join_dataframes(list_of_join_columns, left_df, right_df): return left_df.join(right_df, on=list_of_join_columns) joined_df = functools.reduce( functools.partial(join_dataframes, list_of_join_columns), list_of

Group by a Pyspark Dataframe by time interval

阅读更多关于 Group by a Pyspark Dataframe by time interval

问题 I have a data frame with timestamps generated for it: from pyspark.sql.functions import avg, first rdd = sc.parallelize( [ (0, "A", 223,"201603_170302", "PORT"), (0, "A", 22,"201602_100302", "PORT"), (0, "A", 422,"201601_114300", "DOCK"), (1,"B", 3213,"201602_121302", "DOCK") ] ) df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"]) so I can generate a datetime : dt_parse = udf(lambda x: datetime.strptime(x,"%Y%m%d_%H%M%S") df_data = df_data.withColumn('datetime',

boto3 cannot create client on pyspark worker?

阅读更多关于 boto3 cannot create client on pyspark worker?

问题 I'm trying to send data from the workers of a Pyspark RDD to an SQS queue, using boto3 to talk with AWS. I need to send data directly from the partitions, rather than collecting the RDD and sending data from the driver. I am able to send messages to SQS via boto3 locally & from the Spark driver; also, I can import boto3 and create a boto3 session on the partitions. However when I try to create a client or resource from the partitions I receive an error. I believe boto3 is not correctly