google-cloud-dataflow

BigQueryIO Read vs fromQuery

喜欢而已 提交于 2020-01-24 00:25:35
问题 Say in Dataflow/Apache Beam program, I am trying to read table which has data that is exponentially growing. I want to improve the performance of the read. BigQueryIO.Read.from("projectid:dataset.tablename") or BigQueryIO.Read.fromQuery("SELECT A, B FROM [projectid:dataset.tablename]") Will the performance of my read improve, if i am only selecting the required columns in the table, rather than the entire table in above? I am aware that selecting few columns results in the reduced cost. But

Beam/Google Cloud Dataflow ReadFromPubsub Missing Data

守給你的承諾、 提交于 2020-01-23 03:32:07
问题 I have 2 dataflow streaming pipelines (pubsub to bigquery) with the following code : class transform_class(beam.DoFn): def process(self, element, publish_time=beam.DoFn.TimestampParam, *args, **kwargs): logging.info(element) yield element class identify_and_transform_tables(beam.DoFn): #Adding Publish Timestamp #Since I'm reading from a topic that consist data from multiple tables, #function here is to identify the tables and split them apart def run(pipeline_args=None): # `save_main_session`

How to run dynamic second query in google cloud dataflow?

青春壹個敷衍的年華 提交于 2020-01-23 03:29:05
问题 I'm attempting to do an operation wherein I get a list of Ids via a query, transform them into a string separated by commas (i.e. "1,2,3") and then use it in a secondary query. When attempting to run the second query, I'm given a syntax error: "Target type of a lambda conversion must be an interface" String query = "SELECT DISTINCT campaignId FROM `" + options.getEligibilityInputTable() + "` "; Pipeline p = Pipeline.create(options); p.apply("GetCampaignIds", BigQueryIO.readTableRows()

How to solve Duplicate values exception when I create PCollectionView<Map<String,String>>

巧了我就是萌 提交于 2020-01-23 01:39:26
问题 I'm setting up a slow-changing lookup Map in my Apache-Beam pipeline. It continuously updates the lookup map. For each key in lookup map, I retrieve the latest value in the global window with accumulating mode. But it always meets Exception : org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalArgumentException: Duplicate values for mykey Is anything wrong with this snippet code? If I use .discardingFiredPanes() instead, I will lose information in the last emit. pipeline

Dataflow/apache beam: manage custom module dependencies

不打扰是莪最后的温柔 提交于 2020-01-22 22:58:11
问题 I have a .py pipeline using apache beam that import another module (.py), that is my custom module. I have a strucutre like this: ├── mymain.py └── myothermodule.py I import myothermodule.py in mymain.py like this: import myothermodule When I run locally on DirectRuner , I have no problem. But when I run it on dataflow with DataflowRunner , I have an error that tells: ImportError: No module named myothermodule So I want to know what should I do if I whant this module to be found when running

Dataflow/apache beam: manage custom module dependencies

守給你的承諾、 提交于 2020-01-22 22:57:06
问题 I have a .py pipeline using apache beam that import another module (.py), that is my custom module. I have a strucutre like this: ├── mymain.py └── myothermodule.py I import myothermodule.py in mymain.py like this: import myothermodule When I run locally on DirectRuner , I have no problem. But when I run it on dataflow with DataflowRunner , I have an error that tells: ImportError: No module named myothermodule So I want to know what should I do if I whant this module to be found when running

How to log incoming messages in apache beam pipeline

为君一笑 提交于 2020-01-16 19:06:49
问题 I am writing a simple apache beam streaming pipeline, taking input from a pubsub topic and storing this into bigquery. For hours I thought I am not able to even read a message, as I was simply trying to log the input to console: events = p | 'Read PubSub' >> ReadFromPubSub(subscription=SUBSCRIPTION) logging.info(events) When I write this to text it works fine! However my call to the logger never happens. How to people develop / debug these streaming pipelines? I have tried adding the

Streaming data to Google Cloud Storage from PubSub using Cloud Dataflow

…衆ロ難τιáo~ 提交于 2020-01-15 10:09:07
问题 I am listening to data from pub-sub using streaming data in dataflow. Then I need to upload to storage, process the data and upload it to bigquery. here is my code: public class BotPipline { public static void main(String[] args) { DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class); options.setRunner(BlockingDataflowPipelineRunner.class); options.setProject(MY_PROJECT); options.setStagingLocation(MY_STAGING_LOCATION); options.setStreaming(true);

Streaming data to Google Cloud Storage from PubSub using Cloud Dataflow

我怕爱的太早我们不能终老 提交于 2020-01-15 10:08:06
问题 I am listening to data from pub-sub using streaming data in dataflow. Then I need to upload to storage, process the data and upload it to bigquery. here is my code: public class BotPipline { public static void main(String[] args) { DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class); options.setRunner(BlockingDataflowPipelineRunner.class); options.setProject(MY_PROJECT); options.setStagingLocation(MY_STAGING_LOCATION); options.setStreaming(true);

Streaming data to Google Cloud Storage from PubSub using Cloud Dataflow

纵然是瞬间 提交于 2020-01-15 10:06:21
问题 I am listening to data from pub-sub using streaming data in dataflow. Then I need to upload to storage, process the data and upload it to bigquery. here is my code: public class BotPipline { public static void main(String[] args) { DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class); options.setRunner(BlockingDataflowPipelineRunner.class); options.setProject(MY_PROJECT); options.setStagingLocation(MY_STAGING_LOCATION); options.setStreaming(true);