apache-beam

Input of apache_beam.examples.wordcount

大兔子大兔子 提交于 2019-11-29 17:04:42
I was trying to run the beam Python-SDK example, but I had problem in reading the input. https://cwiki.apache.org/confluence/display/BEAM/Usage+Guide#UsageGuide-RunaPython-SDKPipeline when I used gs://dataflow-samples/shakespeare/kinglear.txt as the input, the error was apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions {'gs://dataflow-samples/shakespeare/kinglear.txt': TypeError("__init__() got an unexpected keyword argument 'response_encoding'",)} when I used my local file, it seemed it didn't actually read the file, and output nothing. The result didn't include

Google Dataflow Pipeline with Instance Local Cache + External REST API calls

人盡茶涼 提交于 2019-11-29 15:40:33
问题 We want to build a Cloud Dataflow Streaming pipeline which ingests events from Pubsub and performs multiple ETL-like operations on each individual event. One of these operations is that each event has a device-id which need to be transformed to a different value (lets call it mapped-id ), the mapping from the device-id->mapped-id being provided by an external service over a REST API. The same device-id might be repeated across multiple events - so these device-id->mapped-id mappings can be

How to combine streaming data with large history data set in Dataflow/Beam

泪湿孤枕 提交于 2019-11-29 12:34:00
问题 I am investigating processing logs from web user sessions via Google Dataflow/Apache Beam and need to combine the user's logs as they come in (streaming) with the history of a user's session from the last month. I have looked at the following approaches: Use a 30 day fixed window: most likely to large of a window to fit into memory, and I do not need to update the user's history, just refer to it Use CoGroupByKey to join two data sets, but the two data sets must have the same window size

import apache_beam metaclass conflict

可紊 提交于 2019-11-29 12:22:08
问题 When I try to import apache beam I get the following error. >>> import apache_beam Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/toor/pfff/local/lib/python2.7/site-packages/apache_beam/__init__.py", line 78, in <module> from apache_beam import io File "/home/toor/pfff/local/lib/python2.7/site-packages/apache_beam/io/__init__.py", line 21, in <module> ... from apitools.base.protorpclite import messages File "/home/toor/pfff/local/lib/python2.7/site-packages

Apache Beam : FlatMap vs Map?

↘锁芯ラ 提交于 2019-11-29 00:06:47
问题 I want to understand in which scenario that I should use FlatMap or Map. The documentation did not seem clear to me. I still do not understand in which scenario I should use the transformation of FlatMap or Map. Could someone give me an example so I can understand their difference? I understand the difference of FlatMap vs Map in Spark, and however not sure if there any similarity? 回答1: These transforms in Beam are exactly same as Spark (Scala too). A Map transform, maps from a PCollection of

join two json in Google Cloud Platform with dataflow

巧了我就是萌 提交于 2019-11-28 20:58:26
I want to find out only female employees out of the two different JSON files and select only the fields which we are interested in and write the output into another JSON. Also I am trying to implement it in Google's cloud platform using Dataflow. Can someone please provide any sample Java code which can be implemented to get the result. Employee JSON {"emp_id":"OrgEmp#1","emp_name":"Adam","emp_dept":"OrgDept#1","emp_country":"USA","emp_gender":"female","emp_birth_year":"1980","emp_salary":"$100000"} {"emp_id":"OrgEmp#1","emp_name":"Scott","emp_dept":"OrgDept#3","emp_country":"USA","emp_gender"

What is Apache Beam? [closed]

拟墨画扇 提交于 2019-11-28 18:19:42
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 2 years ago . I was going through the Apache posts and found a new term called Beam. Can anybody explain what exactly Apache Beam is? I tried to google out but unable to get a clear answer. 回答1: Apache Beam is an open source, unified model for defining and executing both batch and streaming

Dataflow/apache beam - how to access current filename when passing in pattern?

China☆狼群 提交于 2019-11-28 12:19:05
I have seen this question answered before on stack overflow ( https://stackoverflow.com/questions/29983621/how-to-get-filename-when-using-file-pattern-match-in-google-cloud-dataflow ), but not since apache beam has added splittable dofn functionality for python. How would I access the filename of the current file being processed when passing in a file pattern to a gcs bucket? I want to pass the filename into my transform function: with beam.Pipeline(options=pipeline_options) as p: lines = p | ReadFromText('gs://url to file') data = ( lines | 'Jsonify' >> beam.Map(jsonify) | 'Unnest' >> beam

How to do Async Http Call with Apache Beam (Java)?

夙愿已清 提交于 2019-11-28 10:51:27
问题 Input PCollection is http requests, which is a bounded dataset. I want to make async http call (Java) in a ParDo , parse response and put results into output PCollection. My code is below. Getting exception as following. I cound't figure out the reason. need a guide.... java.util.concurrent.CompletionException: java.lang.IllegalStateException: Can't add element ValueInGlobalWindow{value=streaming.mapserver.backfill.EnrichedPoint@2c59e, pane=PaneInfo.NO_FIRING} to committed bundle in

Apache beam windowing: consider late data but emit only one pane

£可爱£侵袭症+ 提交于 2019-11-28 09:44:43
问题 I would like to emit a single pane when the watermark reaches x minutes past the end of the window. This let's me ensure I handle some late data, but still only emit one pane. I am currently working in java. At the moment I can't find proper solutions to this problem. I could emit a single pane when the watermark reaches the end of the window, but then any late data is dropped. I could emit the pane at the end of the window and then again when I receive late data, however in this case I am