pyspark

Saving data back into Cassandra as RDD

為{幸葍}努か 提交于 2019-12-22 12:46:09
问题 I am trying to read messages from Kafka, process the data, and then add the data into cassandra as if it is an RDD. My trouble is saving the data back into cassandra. from __future__ import print_function from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils from pyspark import SparkConf, SparkContext appName = 'Kafka_Cassandra_Test' kafkaBrokers = '1.2.3.4:9092' topic = 'test' cassandraHosts = '1,2,3' sparkMaster = 'spark://mysparkmaster:7077' if _

Difference between Caching mechanism in Spark SQL

守給你的承諾、 提交于 2019-12-22 11:23:15
问题 I am trying to wrap my head around various caching mechanisms in Spark SQL. Is there any difference between the following code snippets: Method 1: cache table test_cache AS select a, b, c from x inner join y on x.a = y.a; Method 2: create temporary view test_cache AS select a, b, c from x inner join y on x.a = y.a; cache table test_cache; Since computations in Spark are Lazy, will Spark cache the results the very first time the temp table is created in Method 2 ? Or will it wait for any

Difference between Caching mechanism in Spark SQL

我是研究僧i 提交于 2019-12-22 11:20:27
问题 I am trying to wrap my head around various caching mechanisms in Spark SQL. Is there any difference between the following code snippets: Method 1: cache table test_cache AS select a, b, c from x inner join y on x.a = y.a; Method 2: create temporary view test_cache AS select a, b, c from x inner join y on x.a = y.a; cache table test_cache; Since computations in Spark are Lazy, will Spark cache the results the very first time the temp table is created in Method 2 ? Or will it wait for any

Why can't I load a PySpark RandomForestClassifier model?

痞子三分冷 提交于 2019-12-22 10:50:04
问题 I can't load a RandomForestClassificationModel saved by Spark. Environment: Apache Spark 2.0.1, standalone mode running on a small (4 machine) cluster. No HDFS - everything is saved to local disks. Build and save model: classifier = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=50) model = classifier.fit(train) result = model.transform(test) model.write().save("/tmp/models/20161030-RF-topics-cats.model") Later, in a separate program: model =

Pyspark java.lang.OutOfMemoryError: Requested array size exceeds VM limit

时光总嘲笑我的痴心妄想 提交于 2019-12-22 10:49:28
问题 I am running a Pyspark job: spark-submit --master yarn-client --driver-memory 150G --num-executors 8 --executor-cores 4 --executor-memory 150G benchmark_script_1.py hdfs:///tmp/data/sample150k 128 hdfs:///tmp/output/sample150k | tee ~/output/sample150k.log The job itself is pretty standard. It just grabs some files and counts them.: print(str(datetime.now()) + " - Ingesting files...") files = sc.wholeTextFiles(inputFileDir, partitions) fileCount = files.count() print(str(datetime.now()) + " -

pyspark interpreter not found in apache zeppelin

旧巷老猫 提交于 2019-12-22 10:46:04
问题 I am having issue with using pyspark in Apache-Zeppelin (version 0.6.0) notebook. Running the following simple code gives me pyspark interpreter not found error %pyspark a = 1+3 Running sc.version gave me res2: String = 1.6.0 which is the version of spark installed on my machine. And running z return res0: org.apache.zeppelin.spark.ZeppelinContext = {} Pyspark works from CLI (using spark 1.6.0 and python 2.6.6) The default python on the machine 2.6.6, while anaconda-python 3.5 is also

Spark RDD partition by key in exclusive way

你说的曾经没有我的故事 提交于 2019-12-22 10:45:16
问题 I would like to partition an RDD by key and have that each parition contains only values of a single key. For example, if I have 100 different values of the key and I repartition(102) , the RDD should have 2 empty partitions and 100 partitions containing each one a single key value. I tried with groupByKey(k).repartition(102) but this does not guarantee the exclusivity of a key in each partition, since I see some partitions containing more values of a single key and more than 2 empty. Is

Using Python's reduce() to join multiple PySpark DataFrames

人盡茶涼 提交于 2019-12-22 10:40:04
问题 Does anyone know why using Python3's functools.reduce() would lead to worse performance when joining multiple PySpark DataFrames than just iteratively joining the same DataFrames using a for loop? Specifically, this gives a massive slowdown followed by an out-of-memory error: def join_dataframes(list_of_join_columns, left_df, right_df): return left_df.join(right_df, on=list_of_join_columns) joined_df = functools.reduce( functools.partial(join_dataframes, list_of_join_columns), list_of

Group by a Pyspark Dataframe by time interval

好久不见. 提交于 2019-12-22 10:24:56
问题 I have a data frame with timestamps generated for it: from pyspark.sql.functions import avg, first rdd = sc.parallelize( [ (0, "A", 223,"201603_170302", "PORT"), (0, "A", 22,"201602_100302", "PORT"), (0, "A", 422,"201601_114300", "DOCK"), (1,"B", 3213,"201602_121302", "DOCK") ] ) df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"]) so I can generate a datetime : dt_parse = udf(lambda x: datetime.strptime(x,"%Y%m%d_%H%M%S") df_data = df_data.withColumn('datetime',

boto3 cannot create client on pyspark worker?

只谈情不闲聊 提交于 2019-12-22 10:13:05
问题 I'm trying to send data from the workers of a Pyspark RDD to an SQS queue, using boto3 to talk with AWS. I need to send data directly from the partitions, rather than collecting the RDD and sending data from the driver. I am able to send messages to SQS via boto3 locally & from the Spark driver; also, I can import boto3 and create a boto3 session on the partitions. However when I try to create a client or resource from the partitions I receive an error. I believe boto3 is not correctly