apache-beam | 易学教程

Writing tfrecords in apche_beam with java

阅读更多关于 Writing tfrecords in apche_beam with java

问题 How can I write the following code in java? If I have list of records/dicts in java how can I write the beam code to write them in tfrecords where tf.train.Examples are serialized. There are lot of examples to do that with python, below is one example in python, how can I write the same logic in java ? import tensorflow as tf import apache_beam as beam from apache_beam.runners.interactive import interactive_runner from apache_beam.coders import ProtoCoder class Foo(beam.DoFn): def process

Exception Handling in Apache Beam pipelines using Python

阅读更多关于 Exception Handling in Apache Beam pipelines using Python

问题 I'm doing a simple pipeline using Apache Beam in python (on GCP Dataflow) to read from PubSub and write on Big Query but can't handle exceptions on pipeline to create alternatives flows. On a simple WriteToBigQuery example: output = json_output | 'Write to BigQuery' >> beam.io.WriteToBigQuery('some-project:dataset.table_name') I tried to put this inside a try/except code, but it doesnt work because when it fails, exceptions seems to be throwed on a Java layer outside my python execution: INFO

Can I force a step in my dataflow pipeline to be single-threaded (and on a single machine)?

阅读更多关于 Can I force a step in my dataflow pipeline to be single-threaded (and on a single machine)?

问题 I have a pipeline that takes URLs for files and downloads these generating BigQuery table rows for each line apart from the header. To avoid duplicate downloads, I want to check URLs against a table of previously downloaded ones and only go ahead and store the URL if it is not already in this "history" table. For this to work I need to either store the history in a database allowing unique values or it might be easier to use BigQuery for this also, but then access to the table must be

How to create Google Cloud Dataflow Wordcount custom template in Python?

阅读更多关于 How to create Google Cloud Dataflow Wordcount custom template in Python?

问题 I can't create a custom Google Cloud Dataflow template using the wordcount example following the instructions here: https://cloud.google.com/dataflow/docs/guides/templates/creating-templates I get an error relating to the RuntimeValueProvider being unaccessible. What am I doing wrong? My main function wordcount.py : """A word-counting workflow.""" from __future__ import absolute_import import argparse import logging import re from past.builtins import unicode import apache_beam as beam from

Test Dataflow with DirectRunner and got lots of verifyUnmodifiedThrowingCheckedExceptions

阅读更多关于 Test Dataflow with DirectRunner and got lots of verifyUnmodifiedThrowingCheckedExceptions

问题 I was testing my Dataflow pipeline using DirectRunner from my Mac and got lots of "WARNING" message like this, may I know how to get rid of them because it is too much that I can not even see my debug message. Thanks Apr 05, 2018 2:14:48 PM org.apache.beam.sdk.util.MutationDetectors$CodedValueMutationDetector verifyUnmodifiedThrowingCheckedExceptions WARNING: Coder of type class org.apache.beam.sdk.coders.SerializableCoder has a #structuralValue method which does not return true when the

How to install python dependencies for dataflow

阅读更多关于 How to install python dependencies for dataflow

问题 I have a very small python dataflow package, the structure of package looks like this . ├── __pycache__ ├── pubsubtobigq.py ├── requirements.txt └── venv the content of requirements.txt is protobuf==3.11.2 protobuf3-to-dict==0.1.5 I ran my pipline using this code python -m pubsubtobigq \ --input_topic "projects/project_name/topics/topic_name" \ --job_name "job_name" \ --output "gs://mybucket/wordcount/outputs" \ --runner DataflowRunner \ --project "project_name" \ --region "us-central1" \ -

Ways of using value provider parameter in Python Apache Beam

阅读更多关于 Ways of using value provider parameter in Python Apache Beam

问题 Right now I'm just able to grab the RunTime value inside a class using a ParDo, is there another way to get to use the runtime parameter like in my functions? This is the code I got right now: class UserOptions(PipelineOptions): @classmethod def _add_argparse_args(cls, parser): parser.add_value_provider_argument('--firestore_document',default='') def run(argv=None): parser = argparse.ArgumentParser() pipeline_options = PipelineOptions() user_options = pipeline_options.view_as(UserOptions)

Ways of using value provider parameter in Python Apache Beam

阅读更多关于 Ways of using value provider parameter in Python Apache Beam

org.apache.kafka.common.errors.RecordTooLargeException - Droping message with size more than max limit and pushing into another kafka topic

阅读更多关于 org.apache.kafka.common.errors.RecordTooLargeException - Droping message with size more than max limit and pushing into another kafka topic

问题 org.apache.kafka.common.errors.RecordTooLargeException : There are some messages at [Partition=Offset]: {binlog-0=170421} whose size is larger than the fetch size 1048576 and hence cannot be returned. Hi, I'm getting the above exception and my apache beam data pipeline fails. I want the kafka reader to ignore message with size more than default size & maybe push it into another topic for logging purposes. Properties kafkaProps = new Properties(); kafkaProps.setProperty("errors.tolerance",

Google Dataflow - Failed to import custom python modules

阅读更多关于 Google Dataflow - Failed to import custom python modules

问题 My Apache beam pipeline implements custom Transforms and ParDo's python modules which further imports other modules written by me. On Local runner this works fine as all the available files are available in the same path. In case of Dataflow runner, pipeline fails with module import error. How do I make custom modules available to all the dataflow workers? Please advise. Below is an example: ImportError: No module named DataAggregation at find_class (/usr/lib/python2.7/pickle.py:1130) at find