apache-beam

Dataflow: Look up a previous event in an event stream

筅森魡賤 提交于 2019-12-11 09:03:00
问题 Resuming what I'm looking for to do with Apache Beam in Google Dataflow is something like LAG in the Azure Stream Analytics Using a window of X minutes where I'm receiving data: |||||| |||||| |||||| |||||| |||||| |||||| | 1 | | 2 | | 3 | | 4 | | 5 | | 6 | |id=x| |id=x| |id=x| |id=x| |id=x| |id=x| |||||| ,|||||| ,|||||| ,|||||| ,|||||| ,|||||| , ... I need to compare the data(n) with data(n-1), for example, following with the previous example, it will be something like this: if data(6) inside

Apache Beam - Reading JSON and Stream

可紊 提交于 2019-12-11 08:54:01
问题 I am writing Apache beam code, where I have to read a JSON file which has placed in the project folder, and read the data and Stream it. This is the sample code to read JSON. Is this correct way of doing it? PipelineOptions options = PipelineOptionsFactory.create(); options.setRunner(SparkRunner.class); Pipeline p = Pipeline.create(options); PCollection<String> lines = p.apply("ReadMyFile", TextIO.read().from("/Users/xyz/eclipse-workspace/beam-prototype/test.json")); System.out.println("lines

Apache Beam:'Unable to find registrar for hdfs'

随声附和 提交于 2019-12-11 08:26:28
问题 I want to run a pipeline with Spark runner and data is stored on a remote machine. The following command has been used to submit the job: ./spark-submit --class org.apache.beam.examples.WordCount --master spark://192.168.1.214:6066 --deploy-mode cluster --supervise --executor-memory 2G --total-executor-cores 4 hdfs://192.168.1.214:9000/input/word-count-ck-0.1.jar --runner=SparkRunner It is creating the following response: Running Spark using the REST application submission protocol. Using

Google Cloud Dataflow Write to CSV from dictionary

只谈情不闲聊 提交于 2019-12-11 08:06:16
问题 I have a dictionary of values that I would like to write to GCS as a valid .CSV file using the Python SDK. I can write the dictionary out as newline separated text file, but I can't seem to find an example converting the dictionary to a valid .CSV. Can anybody suggest the best way to generate csv's within a dataflow pipeline? This answers to this question address Reading from CSV files, but don't really address writing to CSV files. I recognize that CSV files are just text files with rules,

Apache Beam: Error assigning event time using Withtimestamp

断了今生、忘了曾经 提交于 2019-12-11 07:50:47
问题 I have an unbounded Kafka stream sending data with the following fields {"identifier": "xxx", "value": 10.0, "ts":"2019-01-16T10:51:26.326242+0000"} I read the stream using the apache beam sdk for kafka import org.apache.beam.sdk.io.kafka.KafkaIO; pipeline.apply(KafkaIO.<Long, String>read() .withBootstrapServers("kafka:9092") .withTopic("test") .withKeyDeserializer(LongDeserializer.class) .withValueDeserializer(StringDeserializer.class) .updateConsumerProperties(ImmutableMap.of("enable.auto

Beam: ReadAllFromText receive string or list from DoFn different behavior?

瘦欲@ 提交于 2019-12-11 07:47:18
问题 I have one pipeline read file from GCS through Pub\Sub , class ExtractFileNameFn(beam.DoFn): def process(self, element): file_name = 'gs://' + "/".join(element['id'].split("/")[:-1]) logging.info("Load file: " + file_name) yield file_name class LogFn(beam.DoFn): def process(self, element): logging.info(element) return [element] class LogPassThroughFn(beam.DoFn): def process(self, element): logging.info(element) return element ... p | "Read Sub Message" >> beam.io.ReadFromPubSub(topic=args

Delete data from BigQuery while streaming from Dataflow

China☆狼群 提交于 2019-12-11 07:44:14
问题 Is it possible to delete data from a BigQuery table while loading data into it from an Apache Beam pipeline. Our use case is such that we need to delete 3 days prior data from the table on the basis of a timestamp field (time when Dataflow pulls message from Pubsub topic). Is it recommended to do something like this? If yes, is there any way to achieve this? Thank You. 回答1: I think best way of doing this setup you table as partitioned (based on ingestion time) table https://cloud.google.com

large numpy matrix as dataflow side input

佐手、 提交于 2019-12-11 07:37:51
问题 I'm trying to write a Dataflow pipeline in Python that requires a large numpy matrix as a side input. The matrix is saved in cloud storage. Ideally, each Dataflow worker would load the matrix directly from cloud storage. My understanding is that if I say matrix = np.load(LOCAL_PATH_TO_MATRIX) , and then p | "computation" >> beam.Map(computation, matrix) the matrix get shipped from my laptop to each Datflow worker. How could I instead direct each worker to load the matrix directly from cloud

Google DataFlow/Python: Import errors with save_main_session and custom modules in __main__

依然范特西╮ 提交于 2019-12-11 07:03:44
问题 Could somebody please clarify the expected behavior when using save_main_session and custom modules imported in __main__ . My DataFlow pipeline imports 2 non-standard modules - one via requirements.txt and another one via setup_file . Unless I move the imports into the functions where they get used I keep getting import/pickling errors. Sample error is below. From the documentation, I assumed that setting save_main_session would help to solve this problem, but it does not (see error below).

Google Cloud Dataflow: Specifying TempLocation via Command Line Argument

耗尽温柔 提交于 2019-12-11 06:06:15
问题 I am attempting to specify my GCS temp location by passing it as an option in the command-line as shown below. java -jar pipeline-0.0.1-SNAPSHOT.jar --runner=DataflowRunner --project=<my_project> --tempLocation=gs://<my_bucket>/<my_folder> However, I continue to receive a syntax error: java.nio.file.InvalidPathException: Illegal char <:> at index 2: gs://<my_bucket>/<my_folder> I'm referring to the following documentation: https://cloud.google.com/dataflow/pipelines/specifying-exec-params I