spark-streaming

Spark Streaming Kafka stream

扶醉桌前 提交于 2019-12-08 19:36:23
问题 I'm having some issues while trying to read from kafka with spark streaming. My code is: val sparkConf = new SparkConf().setMaster("local[2]").setAppName("KafkaIngestor") val ssc = new StreamingContext(sparkConf, Seconds(2)) val kafkaParams = Map[String, String]( "zookeeper.connect" -> "localhost:2181", "group.id" -> "consumergroup", "metadata.broker.list" -> "localhost:9092", "zookeeper.connection.timeout.ms" -> "10000" //"kafka.auto.offset.reset" -> "smallest" ) val topics = Set("test") val

What's the limit to spark streaming in terms of data amount?

99封情书 提交于 2019-12-08 16:37:43
问题 I have a tens of millions of rows of data. Is it possible to analyze all of these within a week or a day using spark streaming? What's the limit to spark streaming in terms of data amount? I am not sure what's the upper limit and when I should put them into my database since Stream probably can't handle them anymore. I also have different time windows 1,3, 6 hours etc. where I use window operations to separate the data. Please find my code below: conf = SparkConf().setAppName(appname) sc =

Spark Streaming - obtain batch-level performance stats

大城市里の小女人 提交于 2019-12-08 15:23:30
I'm setting up an Apache Spark cluster to perform realtime streaming computations and would like to monitor the performance of the deployment by tracking various metrics like sizes of batches, batch processing times, etc. My Spark Streaming program is written in Scala Questions The Spark monitoring REST API description lists the various endpoints available. However, I couldn't find endpoints that expose batch-level info. Is there a way to get a list of all the Spark batches that have been run for an application and other per-batch details such as follows: Number of events per batch Processing

Spark streaming 1.6.1 is not working with Kinesis asl 1.6.1 and asl 2.0.0-preview

感情迁移 提交于 2019-12-08 14:03:04
问题 I am trying to run spark streaming job on EMR with Kinesis. Spark 1.6.1 with Kinesis ASL 1.6.1. Writing a plain sample wordcount example. <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kinesis-asl_2.10</artifactId> <version>1.6.1</version> </dependency> <dependency> <groupId>com.amazonaws</groupId> <artifactId>amazon-kinesis-client</artifactId> <version>1.6.3</version> </dependency> <dependency> <groupId>com.amazonaws</groupId> <artifactId>amazon-kinesis-producer

Can I say only current batch by watermarking and window logic for aggregating a streaming data in Append Output mode?

时间秒杀一切 提交于 2019-12-08 13:42:59
问题 I am taking a join of a streaming dataset in LHS with a static dataset in RHS. Since there can be multiple matches for a row in LHS in the static dataset the data explodes into duplicate rows for one id of LHS during the left_outer join, I want to group all these rows collecting the RHS matches into a list. Since it is guaranteed there will be no duplicates in the streaming data, I don't want to introduce a synthetic watermarking column and aggregate the data based on a time-window around

Is it possible to recover an broadcast value from Spark-streaming checkpoint

丶灬走出姿态 提交于 2019-12-08 12:37:07
问题 I used hbase-spark to record pv/uv in my spark-streaming project. Then when I killed the app and restart it, I got following exception while checkpoint-recover: 16/03/02 10:17:21 ERROR HBaseContext: Unable to getConfig from broadcast java.lang.ClassCastException: [B cannot be cast to org.apache.spark.SerializableWritable at com.paitao.xmlife.contrib.hbase.HBaseContext.getConf(HBaseContext.scala:645) at com.paitao.xmlife.contrib.hbase.HBaseContext.com$paitao$xmlife$contrib$hbase$HBaseContext$

How to get list of file from Azure blob using Spark/Scala?

可紊 提交于 2019-12-08 12:31:00
问题 How to get list of file from Azure blob storage in Spark and Scala. I am not getting any idea to approach this. 回答1: I don't know the Spark you used is either on Azure or on local. So they are two cases, but similar. For Spark running on local, there is an offical blog which introduces how to access Azure Blob Storage from Spark. The key is that you need to configure Azure Storage account as HDFS-compatible storage in core-site.xml file and add two jars hadoop-azure & azure-storage to your

Spark Streaming from MQTT - IllegalArgumentException

我怕爱的太早我们不能终老 提交于 2019-12-08 12:17:12
问题 I'm trying to consume messages from RabbitMQ in Spark Streaming following this answer: https://stackoverflow.com/a/38172737/1344854. I'm failing with IllegalArgumentException and I don't know why. 6/09/05 13:23:22 ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - java.lang.IllegalArgumentException: ssl://user:password@bunny.cloudamqp.com:8883/vhost at org.eclipse.paho.client.mqttv3.MqttConnectOptions.validateURI(MqttConnectOptions.java:458) at org.eclipse

Spark Streaming From Kafka and Write to HDFS in Avro Format

冷暖自知 提交于 2019-12-08 11:44:46
问题 I basically want to consumes data from Kafka and write it to HDFS. But happens so is that it is not writing any files in hdfs. it create empty files. And also please guide me if i want to write in avro format in hdfs how can i modify the code. For the sake of simplicity am writing to local C drive. import org.apache.spark.SparkConf import org.apache.kafka.common.serialization.StringDeserializer import org.apache.spark.SparkContext import org.apache.spark.streaming.Seconds import org.apache

Spark Streaming : Join Dstream batches into single output Folder

人走茶凉 提交于 2019-12-08 11:44:39
问题 I am using Spark Streaming to fetch tweets from twitter by creating a StreamingContext as : val ssc = new StreamingContext("local[3]", "TwitterFeed",Minutes(1)) and creating twitter stream as : val tweetStream = TwitterUtils.createStream(ssc, Some(new OAuthAuthorization(Util.config)),filters) then saving it as text file tweets.repartition(1).saveAsTextFiles("/tmp/spark_testing/") and the problem is that the tweets are being saved as folders based on batch time but I need all the data of each