apache-beam | 易学教程

Dataflow: Look up a previous event in an event stream

阅读更多关于 Dataflow: Look up a previous event in an event stream

问题 Resuming what I'm looking for to do with Apache Beam in Google Dataflow is something like LAG in the Azure Stream Analytics Using a window of X minutes where I'm receiving data: |||||| |||||| |||||| |||||| |||||| |||||| | 1 | | 2 | | 3 | | 4 | | 5 | | 6 | |id=x| |id=x| |id=x| |id=x| |id=x| |id=x| |||||| ,|||||| ,|||||| ,|||||| ,|||||| ,|||||| , ... I need to compare the data(n) with data(n-1), for example, following with the previous example, it will be something like this: if data(6) inside

Apache Beam - Reading JSON and Stream

阅读更多关于 Apache Beam - Reading JSON and Stream

问题 I am writing Apache beam code, where I have to read a JSON file which has placed in the project folder, and read the data and Stream it. This is the sample code to read JSON. Is this correct way of doing it? PipelineOptions options = PipelineOptionsFactory.create(); options.setRunner(SparkRunner.class); Pipeline p = Pipeline.create(options); PCollection<String> lines = p.apply("ReadMyFile", TextIO.read().from("/Users/xyz/eclipse-workspace/beam-prototype/test.json")); System.out.println("lines

Apache Beam:'Unable to find registrar for hdfs'

阅读更多关于 Apache Beam:'Unable to find registrar for hdfs'

问题 I want to run a pipeline with Spark runner and data is stored on a remote machine. The following command has been used to submit the job: ./spark-submit --class org.apache.beam.examples.WordCount --master spark://192.168.1.214:6066 --deploy-mode cluster --supervise --executor-memory 2G --total-executor-cores 4 hdfs://192.168.1.214:9000/input/word-count-ck-0.1.jar --runner=SparkRunner It is creating the following response: Running Spark using the REST application submission protocol. Using

Google Cloud Dataflow Write to CSV from dictionary

阅读更多关于 Google Cloud Dataflow Write to CSV from dictionary

问题 I have a dictionary of values that I would like to write to GCS as a valid .CSV file using the Python SDK. I can write the dictionary out as newline separated text file, but I can't seem to find an example converting the dictionary to a valid .CSV. Can anybody suggest the best way to generate csv's within a dataflow pipeline? This answers to this question address Reading from CSV files, but don't really address writing to CSV files. I recognize that CSV files are just text files with rules,

Apache Beam: Error assigning event time using Withtimestamp

阅读更多关于 Apache Beam: Error assigning event time using Withtimestamp

问题 I have an unbounded Kafka stream sending data with the following fields {"identifier": "xxx", "value": 10.0, "ts":"2019-01-16T10:51:26.326242+0000"} I read the stream using the apache beam sdk for kafka import org.apache.beam.sdk.io.kafka.KafkaIO; pipeline.apply(KafkaIO.<Long, String>read() .withBootstrapServers("kafka:9092") .withTopic("test") .withKeyDeserializer(LongDeserializer.class) .withValueDeserializer(StringDeserializer.class) .updateConsumerProperties(ImmutableMap.of("enable.auto

Beam: ReadAllFromText receive string or list from DoFn different behavior?

阅读更多关于 Beam: ReadAllFromText receive string or list from DoFn different behavior?

问题 I have one pipeline read file from GCS through Pub\Sub , class ExtractFileNameFn(beam.DoFn): def process(self, element): file_name = 'gs://' + "/".join(element['id'].split("/")[:-1]) logging.info("Load file: " + file_name) yield file_name class LogFn(beam.DoFn): def process(self, element): logging.info(element) return [element] class LogPassThroughFn(beam.DoFn): def process(self, element): logging.info(element) return element ... p | "Read Sub Message" >> beam.io.ReadFromPubSub(topic=args

Delete data from BigQuery while streaming from Dataflow

阅读更多关于 Delete data from BigQuery while streaming from Dataflow

问题 Is it possible to delete data from a BigQuery table while loading data into it from an Apache Beam pipeline. Our use case is such that we need to delete 3 days prior data from the table on the basis of a timestamp field (time when Dataflow pulls message from Pubsub topic). Is it recommended to do something like this? If yes, is there any way to achieve this? Thank You. 回答1: I think best way of doing this setup you table as partitioned (based on ingestion time) table https://cloud.google.com

large numpy matrix as dataflow side input

阅读更多关于 large numpy matrix as dataflow side input

问题 I'm trying to write a Dataflow pipeline in Python that requires a large numpy matrix as a side input. The matrix is saved in cloud storage. Ideally, each Dataflow worker would load the matrix directly from cloud storage. My understanding is that if I say matrix = np.load(LOCAL_PATH_TO_MATRIX) , and then p | "computation" >> beam.Map(computation, matrix) the matrix get shipped from my laptop to each Datflow worker. How could I instead direct each worker to load the matrix directly from cloud

Google DataFlow/Python: Import errors with save_main_session and custom modules in main

阅读更多关于 Google DataFlow/Python: Import errors with save_main_session and custom modules in __main__

问题 Could somebody please clarify the expected behavior when using save_main_session and custom modules imported in __main__ . My DataFlow pipeline imports 2 non-standard modules - one via requirements.txt and another one via setup_file . Unless I move the imports into the functions where they get used I keep getting import/pickling errors. Sample error is below. From the documentation, I assumed that setting save_main_session would help to solve this problem, but it does not (see error below).

Google Cloud Dataflow: Specifying TempLocation via Command Line Argument

阅读更多关于 Google Cloud Dataflow: Specifying TempLocation via Command Line Argument

问题 I am attempting to specify my GCS temp location by passing it as an option in the command-line as shown below. java -jar pipeline-0.0.1-SNAPSHOT.jar --runner=DataflowRunner --project=<my_project> --tempLocation=gs://<my_bucket>/<my_folder> However, I continue to receive a syntax error: java.nio.file.InvalidPathException: Illegal char <:> at index 2: gs://<my_bucket>/<my_folder> I'm referring to the following documentation: https://cloud.google.com/dataflow/pipelines/specifying-exec-params I