spark-streaming | 易学教程

spark streaming write data to Hbase with python blocked on saveAsNewAPIHadoopDataset

阅读更多关于 spark streaming write data to Hbase with python blocked on saveAsNewAPIHadoopDataset

问题 I’m using spark-streaming python read kafka and write to hbase, I found the job on stage of saveAsNewAPIHadoopDataset very easily get blocked. As the below picture: You will find the duration is 8 hours on this stage. Does the spark write data by Hbase api or directly write the data via HDFS api please? 回答1: A bit late , but here is a similar example To save an RDD to hbase : Consider an RDD containing a single line : {"id":3,"name":"Moony","color":"grey","description":"Monochrome kitty"}

Spark Streaming with Actor Never Terminates

阅读更多关于 Spark Streaming with Actor Never Terminates

问题 Using Spark 1.5 Streaming with an Actor receiver. val conf = new SparkConf() .setMaster("local[4]") .setAppName("ModelTest") val ssc = new StreamingContext(conf, Seconds(2)) val models = ssc.actorStream[Model](Props(...), "ModelReceiver") models.foreachRDD { rdd => ... } ssc.start() ssc.awaitTermination() // NEVER GETS HERE! When the generated Actor is shutdown the code will not progress beyond ssc.awaitTermination() If I kill SBT with Ctrl+C a println after the ssc.awaitTermination() line

Spark streaming group by custom function

阅读更多关于 Spark streaming group by custom function

问题 I have input lines like below t1, file1, 1, 1, 1 t1, file1, 1, 2, 3 t1, file2, 2, 2, 2, 2 t2, file1, 5, 5, 5 t2, file2, 1, 1, 2, 2 and i want to achieve the output like below rows which is a vertical addition of the corresponding numbers. file1 : [ 1+1+5, 1+2+5, 1+3+5 ] file2 : [ 2+1, 2+1, 2+2, 2+2 ] I am in a spark streaming context and i am having a hard time trying to figure out the way to aggregate by file name. It seems like i will need to use something like below, i am not sure how to

java.io.NotSerializableException with Spark Streaming Checkpoint enabled

阅读更多关于 java.io.NotSerializableException with Spark Streaming Checkpoint enabled

问题 I have enabled checkpointing in my spark streaming application and encounter this error on a class that is downloaded as a dependency. With no checkpointing the application works great. Error: com.fasterxml.jackson.module.paranamer.shaded.CachingParanamer Serialization stack: - object not serializable (class: com.fasterxml.jackson.module.paranamer.shaded.CachingParanamer, value: com.fasterxml.jackson.module.paranamer.shaded.CachingParanamer@46c7c593) - field (class: com.fasterxml.jackson

Why does Spark Streaming stop working when I send two input streams?

阅读更多关于 Why does Spark Streaming stop working when I send two input streams?

问题 I am develping a Spark Streaming application in which I need to use the input streams from two servers in Python, each sending a JSON message per second to the Spark Context. My problem is, if I perform operations on just one stream, everything works well. But if I have two streams from different servers, then Spark freezes just before it can print anything, and only starts working again when both servers have sent all the JSON messages they had to send (when it detects that '

Spark to read a big file as inputstream

阅读更多关于 Spark to read a big file as inputstream

问题 I know spark built in method can have partition and read huge chunk of file and distributed as rdd using textfile. However, i am reading this in a customized encrytped filessytem which spark does not support by nature. One way i can think of is to read an inputstream instead and loads multiple lines and distributed to executor. Keep reading until all file is loaded. So no executor will blow up due to out of memory error. Is that possible to do this in spark? 回答1: you can try lines.take(n) for

EMR with multiple encryption key providers

阅读更多关于 EMR with multiple encryption key providers

问题 I'm running EMR cluster with enabled s3 client-side encryption using custom key provider. But now I need to write data to multiple s3 destinations using different encryption schemas: CSE custom key provider CSE-KMS Is it possible to configure EMR to use both encryption types by defining some kind of mapping between s3 bucket and encryption type? Alternatively since I use spark structured streaming to process and write data to s3 I'm wondering if it's possible to disable encryption on EMRFS

Controlling the data partition in Apache Spark

阅读更多关于 Controlling the data partition in Apache Spark

问题 Data Looks Like: col 1 col 2 col 3 col 4 row 1 row 1 row 1 row 1 row 2 row 2 row 2 row 2 row 3 row 3 row 3 row 3 row 4 row 4 row 4 row 4 row 5 row 5 row 5 row 5 row 6 row 6 row 6 row 6 Problem: I want to partition this data, lets say row 1 and row 2 will be processed as one partition, row 3 and row 4 as another, row 5 and row 6 as another and create a JSON data merging them together with the column (column headers with data values in rows). Output should be like: [ {col1:row1,col2:row1:col3

Scala spark - Dealing with Hierarchy data tables

阅读更多关于 Scala spark - Dealing with Hierarchy data tables

How to access statistics endpoint for a Spark Streaming application?

阅读更多关于 How to access statistics endpoint for a Spark Streaming application?

As of Spark 2.2.0, there's are new endpoints in the API for getting information about streaming jobs. I run Spark on EMR clusters, using Spark 2.2.0 in cluster mode. When I hit the endpoint for my streaming jobs, all it gives me is the error message: no streaming listener attached to <stream name> I've dug through the Spark codebase a bit, but this feature is not very well documented. So I'm curious if this is a bug? Is there some configuration I need to do to get this endpoint working? This appears to be an issue specifically when running on the cluster. The same code running on Spark 2.2.0