apache-flink | 易学教程

How to extract part of a string in json format from Kafka in Flink 1.2

阅读更多关于 How to extract part of a string in json format from Kafka in Flink 1.2

问题 My goal is to use kafka to read in a string in json format, do a filter to the string, select part of the message and sink the message out (still in json string format). For testing purpose, my input string message looks like: {"a":1,"b":2，"c":"3"} And my code of implementation is: def main(args: Array[String]): Unit = { val inputProperties = new Properties() inputProperties.setProperty("bootstrap.servers", "localhost:9092") inputProperties.setProperty("group.id", "myTest2") val inputTopic =

sbt publish (or publishLocal) VS sbt assembly for distribution purposes and dependency conflicts resolution

阅读更多关于 sbt publish (or publishLocal) VS sbt assembly for distribution purposes and dependency conflicts resolution

问题 Bottom line is that I want to distribute a library that can be integrated using SBT or Maven and whose dependencies won't conflict with the integrating project's dependencies or transitive dependencies. Currently I am distributing my library through SBT using the publish command which is configured to publish the artifacts to my private JFrog Artifactory . It is working as expected in the sense that it will publish the library to artifactory and that I can easily integrate the resulting

Apache Flink: number of TaskManagers per machine

阅读更多关于 Apache Flink: number of TaskManagers per machine

问题 The number of CPU cores per machine is four. In flink standalone mode, how should I set the number of TaskManagers on each machine? 1 TaskManager, each TaskManager has 4 slots. 2 TaskManagers, each TaskManager has 2 slots. 4 TaskManagers, each TaskManager has 1 slot. This setting is like apache-storm. 回答1: Normally you'd have one TaskManager per server, and (as per the doc that bupt_ljy referenced) one slot per physical CPU core. So I'd go with your option #1. 回答2: There's also the

Flink + Kafka, java.lang.OutOfMemoryError when parallelism > 1

阅读更多关于 Flink + Kafka, java.lang.OutOfMemoryError when parallelism > 1

问题 I've a toy Flink job which reads from 3 kafka topics, then union all these 3 streams. That's all, no extra work. If using parallelism 1 for my Flink job, everything seems fine, as soos as I change parallelism > 1, it fails with: java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:693) at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at sun.nio.ch.Util.getTemporaryDirectBuffer(Util

Sorting union of streams to identify user sessions in Apache Flink

阅读更多关于 Sorting union of streams to identify user sessions in Apache Flink

问题 I have two streams of events L = (l1, l3, l8, ...) - is sparser and represents user logins to a IP E = (e2, e4, e5, e9, ...) - is a stream of logs the particular IP the lower index represents a timestamp... If we joined the two streams together and sorted them by time we would get: l1 , e2 , l3 , e4, e5 , l8 , e9 , ... Would it be possible to implement custom Window / Trigger functions to group the event to sessions (time between logins of different users): l1 - l3 : e2 l3 - l8 : e4, e5 l8 -

How to read and write to HBase in flink streaming job

阅读更多关于 How to read and write to HBase in flink streaming job

问题 If we have to read and write to HBASE in a streaming application how could we do that. We open a connection via open method for write, how could we open a connection for read. object test { if (args.length != 11) { //print args System.exit(1) } val Array() = args println("Parameters Passed " + ...); val env = StreamExecutionEnvironment.getExecutionEnvironment val properties = new Properties() properties.setProperty("bootstrap.servers", metadataBrokerList) properties.setProperty("zookeeper

Enrich fast stream keyed by (X,Y) with a slowly change stream keyed by (X) in Flink

阅读更多关于 Enrich fast stream keyed by (X,Y) with a slowly change stream keyed by (X) in Flink

问题 I need to enrich my fast changing streamA keyed by (userId, startTripTimestamp) with slowly changing streamB keyed by (userId). I use Flink 1.8 with DataStream API. I consider 2 approaches: Broadcast streamB and join stream by userId and most recent timestamp. Would it be equivalent of DynamicTable from the TableAPI? I can see some downsides of this solution: streamB needs to fit into RAM of each worker node, it increase utilization of RAM as whole streamB needs to be stored in RAM of each

Flink rest api error: Request did not match expected format JarRunRequestBody

阅读更多关于 Flink rest api error: Request did not match expected format JarRunRequestBody

问题 Trying to run a flink job remotely using the below rest api but its throwing error curl -X POST -H 'Content-Type: application/json' --data ' { "type": "object", "id": "urn:jsonschema:org:apache:flink:runtime:webmonitor:handlers:JarRunRequestBody", "properties": { "programArgsList" : { "type" : "array", "items" : [ "input-kafka-server": "****", "input-kafka-topics": "****", "input-job-name": "****" } } } ' http://x.x.x.x:8081/jars/810ac968-5d5f-450d-aafc-22655238d617.jar/run {"errors":[

Reuse of a Stream is a copy of stream or not

阅读更多关于 Reuse of a Stream is a copy of stream or not

问题 For example, there is a keyed stream: val keyedStream: KeyedStream[event, Key] = env .addSource(...) .keyBy(...) // several transformations on the same stream keyedStream.map(....) keyedStream.window(....) keyedStream.split(....) keyedStream...(....) I think this is the reuse of same stream in Flink, what I found is that when I reused it, the content of stream is not affected by the other transformation, so I think it is a copy of a same stream. But I don't know if it is right or not. If yes,

Adding patterns dynamically in Apache Flink without restarting job

阅读更多关于 Adding patterns dynamically in Apache Flink without restarting job

问题 My use case is that I want to apply different CEP patterns to the same datastream. the CEP patterns come dynamically & i want them to be added to flink without having to restart the job. While all conditions can be handled via custom classes that implement IterativeCondition, my main problem is that the temporal condition accepts only TimeWindow; which cannot be handled. Is there some way that the value passed to .within() be set based on the input elements? Something similar was asked here: