spark-streaming

spark streaming write data to Hbase with python blocked on saveAsNewAPIHadoopDataset

泪湿孤枕 提交于 2019-12-08 09:12:15
问题 I’m using spark-streaming python read kafka and write to hbase, I found the job on stage of saveAsNewAPIHadoopDataset very easily get blocked. As the below picture: You will find the duration is 8 hours on this stage. Does the spark write data by Hbase api or directly write the data via HDFS api please? 回答1: A bit late , but here is a similar example To save an RDD to hbase : Consider an RDD containing a single line : {"id":3,"name":"Moony","color":"grey","description":"Monochrome kitty"}

Spark Streaming with Actor Never Terminates

懵懂的女人 提交于 2019-12-08 08:49:54
问题 Using Spark 1.5 Streaming with an Actor receiver. val conf = new SparkConf() .setMaster("local[4]") .setAppName("ModelTest") val ssc = new StreamingContext(conf, Seconds(2)) val models = ssc.actorStream[Model](Props(...), "ModelReceiver") models.foreachRDD { rdd => ... } ssc.start() ssc.awaitTermination() // NEVER GETS HERE! When the generated Actor is shutdown the code will not progress beyond ssc.awaitTermination() If I kill SBT with Ctrl+C a println after the ssc.awaitTermination() line

Spark streaming group by custom function

一笑奈何 提交于 2019-12-08 08:45:18
问题 I have input lines like below t1, file1, 1, 1, 1 t1, file1, 1, 2, 3 t1, file2, 2, 2, 2, 2 t2, file1, 5, 5, 5 t2, file2, 1, 1, 2, 2 and i want to achieve the output like below rows which is a vertical addition of the corresponding numbers. file1 : [ 1+1+5, 1+2+5, 1+3+5 ] file2 : [ 2+1, 2+1, 2+2, 2+2 ] I am in a spark streaming context and i am having a hard time trying to figure out the way to aggregate by file name. It seems like i will need to use something like below, i am not sure how to

java.io.NotSerializableException with Spark Streaming Checkpoint enabled

≯℡__Kan透↙ 提交于 2019-12-08 07:24:37
问题 I have enabled checkpointing in my spark streaming application and encounter this error on a class that is downloaded as a dependency. With no checkpointing the application works great. Error: com.fasterxml.jackson.module.paranamer.shaded.CachingParanamer Serialization stack: - object not serializable (class: com.fasterxml.jackson.module.paranamer.shaded.CachingParanamer, value: com.fasterxml.jackson.module.paranamer.shaded.CachingParanamer@46c7c593) - field (class: com.fasterxml.jackson

Why does Spark Streaming stop working when I send two input streams?

丶灬走出姿态 提交于 2019-12-08 07:16:38
问题 I am develping a Spark Streaming application in which I need to use the input streams from two servers in Python, each sending a JSON message per second to the Spark Context. My problem is, if I perform operations on just one stream, everything works well. But if I have two streams from different servers, then Spark freezes just before it can print anything, and only starts working again when both servers have sent all the JSON messages they had to send (when it detects that '

Spark to read a big file as inputstream

拥有回忆 提交于 2019-12-08 06:19:27
问题 I know spark built in method can have partition and read huge chunk of file and distributed as rdd using textfile. However, i am reading this in a customized encrytped filessytem which spark does not support by nature. One way i can think of is to read an inputstream instead and loads multiple lines and distributed to executor. Keep reading until all file is loaded. So no executor will blow up due to out of memory error. Is that possible to do this in spark? 回答1: you can try lines.take(n) for

EMR with multiple encryption key providers

折月煮酒 提交于 2019-12-08 06:17:01
问题 I'm running EMR cluster with enabled s3 client-side encryption using custom key provider. But now I need to write data to multiple s3 destinations using different encryption schemas: CSE custom key provider CSE-KMS Is it possible to configure EMR to use both encryption types by defining some kind of mapping between s3 bucket and encryption type? Alternatively since I use spark structured streaming to process and write data to s3 I'm wondering if it's possible to disable encryption on EMRFS

Controlling the data partition in Apache Spark

删除回忆录丶 提交于 2019-12-08 05:06:48
问题 Data Looks Like: col 1 col 2 col 3 col 4 row 1 row 1 row 1 row 1 row 2 row 2 row 2 row 2 row 3 row 3 row 3 row 3 row 4 row 4 row 4 row 4 row 5 row 5 row 5 row 5 row 6 row 6 row 6 row 6 Problem: I want to partition this data, lets say row 1 and row 2 will be processed as one partition, row 3 and row 4 as another, row 5 and row 6 as another and create a JSON data merging them together with the column (column headers with data values in rows). Output should be like: [ {col1:row1,col2:row1:col3

Scala spark - Dealing with Hierarchy data tables

旧时模样 提交于 2019-12-08 05:02:58
问题 I have data table with hierarchy data model with tree structures. For example: Here is a sample data row: ------------------------------------------- Id | name |parentId | path | depth ------------------------------------------- 55 | Canada | null | null | 0 77 | Ontario | 55 | /55 | 1 100| Toronto | 77 | /55/77 | 2 104| Brampton| 100 | /55/77/100 | 3 I am looking to convert those rows into flattening version, sample output would be: ----------------------------------- Id | name | parentId |

How to access statistics endpoint for a Spark Streaming application?

末鹿安然 提交于 2019-12-08 04:44:28
As of Spark 2.2.0, there's are new endpoints in the API for getting information about streaming jobs. I run Spark on EMR clusters, using Spark 2.2.0 in cluster mode. When I hit the endpoint for my streaming jobs, all it gives me is the error message: no streaming listener attached to <stream name> I've dug through the Spark codebase a bit, but this feature is not very well documented. So I'm curious if this is a bug? Is there some configuration I need to do to get this endpoint working? This appears to be an issue specifically when running on the cluster. The same code running on Spark 2.2.0