google-cloud-dataflow

Can I force a step in my dataflow pipeline to be single-threaded (and on a single machine)?

。_饼干妹妹 提交于 2020-03-26 02:23:31
问题 I have a pipeline that takes URLs for files and downloads these generating BigQuery table rows for each line apart from the header. To avoid duplicate downloads, I want to check URLs against a table of previously downloaded ones and only go ahead and store the URL if it is not already in this "history" table. For this to work I need to either store the history in a database allowing unique values or it might be easier to use BigQuery for this also, but then access to the table must be

How to create Google Cloud Dataflow Wordcount custom template in Python?

ぃ、小莉子 提交于 2020-03-25 21:48:43
问题 I can't create a custom Google Cloud Dataflow template using the wordcount example following the instructions here: https://cloud.google.com/dataflow/docs/guides/templates/creating-templates I get an error relating to the RuntimeValueProvider being unaccessible. What am I doing wrong? My main function wordcount.py : """A word-counting workflow.""" from __future__ import absolute_import import argparse import logging import re from past.builtins import unicode import apache_beam as beam from

Test Dataflow with DirectRunner and got lots of verifyUnmodifiedThrowingCheckedExceptions

前提是你 提交于 2020-03-18 15:54:19
问题 I was testing my Dataflow pipeline using DirectRunner from my Mac and got lots of "WARNING" message like this, may I know how to get rid of them because it is too much that I can not even see my debug message. Thanks Apr 05, 2018 2:14:48 PM org.apache.beam.sdk.util.MutationDetectors$CodedValueMutationDetector verifyUnmodifiedThrowingCheckedExceptions WARNING: Coder of type class org.apache.beam.sdk.coders.SerializableCoder has a #structuralValue method which does not return true when the

Using custom docker containers in Dataflow

不打扰是莪最后的温柔 提交于 2020-03-14 05:04:21
问题 From this link I found that Google Cloud Dataflow uses Docker containers for its workers: Image for Google Cloud Dataflow instances I see it's possible to find out the image name of the docker container. But, is there a way I can get this docker container (ie from which repository do I go to get it?), modify it, and then indicate my Dataflow job to use this new docker container? The reason I ask is that we need to install various C++ and Fortran and other library code on our dockers so that

How to install python dependencies for dataflow

◇◆丶佛笑我妖孽 提交于 2020-03-05 06:05:08
问题 I have a very small python dataflow package, the structure of package looks like this . ├── __pycache__ ├── pubsubtobigq.py ├── requirements.txt └── venv the content of requirements.txt is protobuf==3.11.2 protobuf3-to-dict==0.1.5 I ran my pipline using this code python -m pubsubtobigq \ --input_topic "projects/project_name/topics/topic_name" \ --job_name "job_name" \ --output "gs://mybucket/wordcount/outputs" \ --runner DataflowRunner \ --project "project_name" \ --region "us-central1" \ -

Ways of using value provider parameter in Python Apache Beam

心已入冬 提交于 2020-03-05 04:24:12
问题 Right now I'm just able to grab the RunTime value inside a class using a ParDo, is there another way to get to use the runtime parameter like in my functions? This is the code I got right now: class UserOptions(PipelineOptions): @classmethod def _add_argparse_args(cls, parser): parser.add_value_provider_argument('--firestore_document',default='') def run(argv=None): parser = argparse.ArgumentParser() pipeline_options = PipelineOptions() user_options = pipeline_options.view_as(UserOptions)

Ways of using value provider parameter in Python Apache Beam

别说谁变了你拦得住时间么 提交于 2020-03-05 04:24:08
问题 Right now I'm just able to grab the RunTime value inside a class using a ParDo, is there another way to get to use the runtime parameter like in my functions? This is the code I got right now: class UserOptions(PipelineOptions): @classmethod def _add_argparse_args(cls, parser): parser.add_value_provider_argument('--firestore_document',default='') def run(argv=None): parser = argparse.ArgumentParser() pipeline_options = PipelineOptions() user_options = pipeline_options.view_as(UserOptions)

Google Dataflow - Failed to import custom python modules

喜你入骨 提交于 2020-03-03 05:14:20
问题 My Apache beam pipeline implements custom Transforms and ParDo's python modules which further imports other modules written by me. On Local runner this works fine as all the available files are available in the same path. In case of Dataflow runner, pipeline fails with module import error. How do I make custom modules available to all the dataflow workers? Please advise. Below is an example: ImportError: No module named DataAggregation at find_class (/usr/lib/python2.7/pickle.py:1130) at find

Google Dataflow - Failed to import custom python modules

青春壹個敷衍的年華 提交于 2020-03-03 05:12:50
问题 My Apache beam pipeline implements custom Transforms and ParDo's python modules which further imports other modules written by me. On Local runner this works fine as all the available files are available in the same path. In case of Dataflow runner, pipeline fails with module import error. How do I make custom modules available to all the dataflow workers? Please advise. Below is an example: ImportError: No module named DataAggregation at find_class (/usr/lib/python2.7/pickle.py:1130) at find

Google Dataflow - Failed to import custom python modules

◇◆丶佛笑我妖孽 提交于 2020-03-03 05:12:06
问题 My Apache beam pipeline implements custom Transforms and ParDo's python modules which further imports other modules written by me. On Local runner this works fine as all the available files are available in the same path. In case of Dataflow runner, pipeline fails with module import error. How do I make custom modules available to all the dataflow workers? Please advise. Below is an example: ImportError: No module named DataAggregation at find_class (/usr/lib/python2.7/pickle.py:1130) at find