apache-flink

How to extract part of a string in json format from Kafka in Flink 1.2

允我心安 提交于 2019-12-24 19:23:56
问题 My goal is to use kafka to read in a string in json format, do a filter to the string, select part of the message and sink the message out (still in json string format). For testing purpose, my input string message looks like: {"a":1,"b":2,"c":"3"} And my code of implementation is: def main(args: Array[String]): Unit = { val inputProperties = new Properties() inputProperties.setProperty("bootstrap.servers", "localhost:9092") inputProperties.setProperty("group.id", "myTest2") val inputTopic =

sbt publish (or publishLocal) VS sbt assembly for distribution purposes and dependency conflicts resolution

北慕城南 提交于 2019-12-24 18:29:14
问题 Bottom line is that I want to distribute a library that can be integrated using SBT or Maven and whose dependencies won't conflict with the integrating project's dependencies or transitive dependencies. Currently I am distributing my library through SBT using the publish command which is configured to publish the artifacts to my private JFrog Artifactory . It is working as expected in the sense that it will publish the library to artifactory and that I can easily integrate the resulting

Apache Flink: number of TaskManagers per machine

允我心安 提交于 2019-12-24 11:53:29
问题 The number of CPU cores per machine is four. In flink standalone mode, how should I set the number of TaskManagers on each machine? 1 TaskManager, each TaskManager has 4 slots. 2 TaskManagers, each TaskManager has 2 slots. 4 TaskManagers, each TaskManager has 1 slot. This setting is like apache-storm. 回答1: Normally you'd have one TaskManager per server, and (as per the doc that bupt_ljy referenced) one slot per physical CPU core. So I'd go with your option #1. 回答2: There's also the

Flink + Kafka, java.lang.OutOfMemoryError when parallelism > 1

廉价感情. 提交于 2019-12-24 11:22:09
问题 I've a toy Flink job which reads from 3 kafka topics, then union all these 3 streams. That's all, no extra work. If using parallelism 1 for my Flink job, everything seems fine, as soos as I change parallelism > 1, it fails with: java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:693) at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at sun.nio.ch.Util.getTemporaryDirectBuffer(Util

Sorting union of streams to identify user sessions in Apache Flink

人盡茶涼 提交于 2019-12-24 09:35:10
问题 I have two streams of events L = (l1, l3, l8, ...) - is sparser and represents user logins to a IP E = (e2, e4, e5, e9, ...) - is a stream of logs the particular IP the lower index represents a timestamp... If we joined the two streams together and sorted them by time we would get: l1 , e2 , l3 , e4, e5 , l8 , e9 , ... Would it be possible to implement custom Window / Trigger functions to group the event to sessions (time between logins of different users): l1 - l3 : e2 l3 - l8 : e4, e5 l8 -

How to read and write to HBase in flink streaming job

强颜欢笑 提交于 2019-12-24 09:04:36
问题 If we have to read and write to HBASE in a streaming application how could we do that. We open a connection via open method for write, how could we open a connection for read. object test { if (args.length != 11) { //print args System.exit(1) } val Array() = args println("Parameters Passed " + ...); val env = StreamExecutionEnvironment.getExecutionEnvironment val properties = new Properties() properties.setProperty("bootstrap.servers", metadataBrokerList) properties.setProperty("zookeeper

Enrich fast stream keyed by (X,Y) with a slowly change stream keyed by (X) in Flink

故事扮演 提交于 2019-12-24 05:15:09
问题 I need to enrich my fast changing streamA keyed by (userId, startTripTimestamp) with slowly changing streamB keyed by (userId). I use Flink 1.8 with DataStream API. I consider 2 approaches: Broadcast streamB and join stream by userId and most recent timestamp. Would it be equivalent of DynamicTable from the TableAPI? I can see some downsides of this solution: streamB needs to fit into RAM of each worker node, it increase utilization of RAM as whole streamB needs to be stored in RAM of each

Flink rest api error: Request did not match expected format JarRunRequestBody

三世轮回 提交于 2019-12-24 04:10:11
问题 Trying to run a flink job remotely using the below rest api but its throwing error curl -X POST -H 'Content-Type: application/json' --data ' { "type": "object", "id": "urn:jsonschema:org:apache:flink:runtime:webmonitor:handlers:JarRunRequestBody", "properties": { "programArgsList" : { "type" : "array", "items" : [ "input-kafka-server": "****", "input-kafka-topics": "****", "input-job-name": "****" } } } ' http://x.x.x.x:8081/jars/810ac968-5d5f-450d-aafc-22655238d617.jar/run {"errors":[

Reuse of a Stream is a copy of stream or not

江枫思渺然 提交于 2019-12-24 02:52:42
问题 For example, there is a keyed stream: val keyedStream: KeyedStream[event, Key] = env .addSource(...) .keyBy(...) // several transformations on the same stream keyedStream.map(....) keyedStream.window(....) keyedStream.split(....) keyedStream...(....) I think this is the reuse of same stream in Flink, what I found is that when I reused it, the content of stream is not affected by the other transformation, so I think it is a copy of a same stream. But I don't know if it is right or not. If yes,

Adding patterns dynamically in Apache Flink without restarting job

人走茶凉 提交于 2019-12-24 01:17:22
问题 My use case is that I want to apply different CEP patterns to the same datastream. the CEP patterns come dynamically & i want them to be added to flink without having to restart the job. While all conditions can be handled via custom classes that implement IterativeCondition, my main problem is that the temporal condition accepts only TimeWindow; which cannot be handled. Is there some way that the value passed to .within() be set based on the input elements? Something similar was asked here: