apache-beam

Writing tfrecords in apche_beam with java

牧云@^-^@ 提交于 2020-04-18 01:06:00
问题 How can I write the following code in java? If I have list of records/dicts in java how can I write the beam code to write them in tfrecords where tf.train.Examples are serialized. There are lot of examples to do that with python, below is one example in python, how can I write the same logic in java ? import tensorflow as tf import apache_beam as beam from apache_beam.runners.interactive import interactive_runner from apache_beam.coders import ProtoCoder class Foo(beam.DoFn): def process

Exception Handling in Apache Beam pipelines using Python

天涯浪子 提交于 2020-04-13 16:47:12
问题 I'm doing a simple pipeline using Apache Beam in python (on GCP Dataflow) to read from PubSub and write on Big Query but can't handle exceptions on pipeline to create alternatives flows. On a simple WriteToBigQuery example: output = json_output | 'Write to BigQuery' >> beam.io.WriteToBigQuery('some-project:dataset.table_name') I tried to put this inside a try/except code, but it doesnt work because when it fails, exceptions seems to be throwed on a Java layer outside my python execution: INFO

Can I force a step in my dataflow pipeline to be single-threaded (and on a single machine)?

。_饼干妹妹 提交于 2020-03-26 02:23:31
问题 I have a pipeline that takes URLs for files and downloads these generating BigQuery table rows for each line apart from the header. To avoid duplicate downloads, I want to check URLs against a table of previously downloaded ones and only go ahead and store the URL if it is not already in this "history" table. For this to work I need to either store the history in a database allowing unique values or it might be easier to use BigQuery for this also, but then access to the table must be

How to create Google Cloud Dataflow Wordcount custom template in Python?

ぃ、小莉子 提交于 2020-03-25 21:48:43
问题 I can't create a custom Google Cloud Dataflow template using the wordcount example following the instructions here: https://cloud.google.com/dataflow/docs/guides/templates/creating-templates I get an error relating to the RuntimeValueProvider being unaccessible. What am I doing wrong? My main function wordcount.py : """A word-counting workflow.""" from __future__ import absolute_import import argparse import logging import re from past.builtins import unicode import apache_beam as beam from

Test Dataflow with DirectRunner and got lots of verifyUnmodifiedThrowingCheckedExceptions

前提是你 提交于 2020-03-18 15:54:19
问题 I was testing my Dataflow pipeline using DirectRunner from my Mac and got lots of "WARNING" message like this, may I know how to get rid of them because it is too much that I can not even see my debug message. Thanks Apr 05, 2018 2:14:48 PM org.apache.beam.sdk.util.MutationDetectors$CodedValueMutationDetector verifyUnmodifiedThrowingCheckedExceptions WARNING: Coder of type class org.apache.beam.sdk.coders.SerializableCoder has a #structuralValue method which does not return true when the

How to install python dependencies for dataflow

◇◆丶佛笑我妖孽 提交于 2020-03-05 06:05:08
问题 I have a very small python dataflow package, the structure of package looks like this . ├── __pycache__ ├── pubsubtobigq.py ├── requirements.txt └── venv the content of requirements.txt is protobuf==3.11.2 protobuf3-to-dict==0.1.5 I ran my pipline using this code python -m pubsubtobigq \ --input_topic "projects/project_name/topics/topic_name" \ --job_name "job_name" \ --output "gs://mybucket/wordcount/outputs" \ --runner DataflowRunner \ --project "project_name" \ --region "us-central1" \ -

Ways of using value provider parameter in Python Apache Beam

心已入冬 提交于 2020-03-05 04:24:12
问题 Right now I'm just able to grab the RunTime value inside a class using a ParDo, is there another way to get to use the runtime parameter like in my functions? This is the code I got right now: class UserOptions(PipelineOptions): @classmethod def _add_argparse_args(cls, parser): parser.add_value_provider_argument('--firestore_document',default='') def run(argv=None): parser = argparse.ArgumentParser() pipeline_options = PipelineOptions() user_options = pipeline_options.view_as(UserOptions)

Ways of using value provider parameter in Python Apache Beam

别说谁变了你拦得住时间么 提交于 2020-03-05 04:24:08
问题 Right now I'm just able to grab the RunTime value inside a class using a ParDo, is there another way to get to use the runtime parameter like in my functions? This is the code I got right now: class UserOptions(PipelineOptions): @classmethod def _add_argparse_args(cls, parser): parser.add_value_provider_argument('--firestore_document',default='') def run(argv=None): parser = argparse.ArgumentParser() pipeline_options = PipelineOptions() user_options = pipeline_options.view_as(UserOptions)

org.apache.kafka.common.errors.RecordTooLargeException - Droping message with size more than max limit and pushing into another kafka topic

╄→гoц情女王★ 提交于 2020-03-04 18:18:13
问题 org.apache.kafka.common.errors.RecordTooLargeException : There are some messages at [Partition=Offset]: {binlog-0=170421} whose size is larger than the fetch size 1048576 and hence cannot be returned. Hi, I'm getting the above exception and my apache beam data pipeline fails. I want the kafka reader to ignore message with size more than default size & maybe push it into another topic for logging purposes. Properties kafkaProps = new Properties(); kafkaProps.setProperty("errors.tolerance",

Google Dataflow - Failed to import custom python modules

喜你入骨 提交于 2020-03-03 05:14:20
问题 My Apache beam pipeline implements custom Transforms and ParDo's python modules which further imports other modules written by me. On Local runner this works fine as all the available files are available in the same path. In case of Dataflow runner, pipeline fails with module import error. How do I make custom modules available to all the dataflow workers? Please advise. Below is an example: ImportError: No module named DataAggregation at find_class (/usr/lib/python2.7/pickle.py:1130) at find