apache-beam

Failed to update work status Exception in Python Cloud Dataflow

非 Y 不嫁゛ 提交于 2020-01-24 19:18:22
问题 I have a Python Cloud Dataflow job that works fine on smaller subsets, but seems to be failing for no obvious reasons on the complete dataset. The only error I get in the Dataflow interface is the standard error message: A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. Analysing the Stackdriver logs only shows this error: Exception in worker loop: Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages

Google-cloud-dataflow: Failed to insert json data to bigquery through `WriteToBigQuery/BigQuerySink` with `BigQueryDisposition.WRITE_TRUNCATE`

瘦欲@ 提交于 2020-01-24 13:04:07
问题 Given the data set as below {"slot":"reward","result":1,"rank":1,"isLandscape":false,"p_type":"main","level":1276,"type":"ba","seqNum":42544} {"slot":"reward_dlg","result":1,"rank":1,"isLandscape":false,"p_type":"main","level":1276,"type":"ba","seqNum":42545} ...more type json data here I try to filter those json data and insert them into bigquery with python sdk as following ba_schema = 'slot:STRING,result:INTEGER,play_type:STRING,level:INTEGER' class ParseJsonDoFn(beam.DoFn): B_TYPE = 'tag

Apache Beam - What are the key concepts for writing efficient data processing pipelines I should be aware of?

元气小坏坏 提交于 2020-01-24 01:12:28
问题 I've been using Beam for some time now and I'd like to know what are the key concepts for writing efficient and optimized Beam pipelines. I have a little Spark background and I know that we may prefer to use a reduceByKey instead of a groupByKey to avoid shuffling and optimise network traffic. Is it the same case for Beam? I'd appreciate some tips or materials/best pratices. 回答1: Some items to consider: Graph Design Considerations: Filer first; place filter operations as high in the DAG as

Apache-Beam: Read parquet files from nested HDFS directories

谁说我不能喝 提交于 2020-01-24 00:27:30
问题 How could I read all parquet files stored in HDFS using Apache-Beam 2.13.0 python sdk with direct runner if the directory structure is the following: data/ ├── a │ ├── file_1.parquet │ └── file_2.parquet └── b ├── file_3.parquet └── file_4.parquet I tried beam.io.ReadFromParquet and hdfs://data/*/* : import apache_beam as beam from apache_beam.options.pipeline_options import PipelineOptions HDFS_HOSTNAME = 'my-hadoop-master-node.com' HDFS_PORT = 50070 HDFS_USER = "my-user-name" pipeline

Apache-Beam: Read parquet files from nested HDFS directories

谁说胖子不能爱 提交于 2020-01-24 00:27:07
问题 How could I read all parquet files stored in HDFS using Apache-Beam 2.13.0 python sdk with direct runner if the directory structure is the following: data/ ├── a │ ├── file_1.parquet │ └── file_2.parquet └── b ├── file_3.parquet └── file_4.parquet I tried beam.io.ReadFromParquet and hdfs://data/*/* : import apache_beam as beam from apache_beam.options.pipeline_options import PipelineOptions HDFS_HOSTNAME = 'my-hadoop-master-node.com' HDFS_PORT = 50070 HDFS_USER = "my-user-name" pipeline

Beam/Google Cloud Dataflow ReadFromPubsub Missing Data

守給你的承諾、 提交于 2020-01-23 03:32:07
问题 I have 2 dataflow streaming pipelines (pubsub to bigquery) with the following code : class transform_class(beam.DoFn): def process(self, element, publish_time=beam.DoFn.TimestampParam, *args, **kwargs): logging.info(element) yield element class identify_and_transform_tables(beam.DoFn): #Adding Publish Timestamp #Since I'm reading from a topic that consist data from multiple tables, #function here is to identify the tables and split them apart def run(pipeline_args=None): # `save_main_session`

How to run dynamic second query in google cloud dataflow?

青春壹個敷衍的年華 提交于 2020-01-23 03:29:05
问题 I'm attempting to do an operation wherein I get a list of Ids via a query, transform them into a string separated by commas (i.e. "1,2,3") and then use it in a secondary query. When attempting to run the second query, I'm given a syntax error: "Target type of a lambda conversion must be an interface" String query = "SELECT DISTINCT campaignId FROM `" + options.getEligibilityInputTable() + "` "; Pipeline p = Pipeline.create(options); p.apply("GetCampaignIds", BigQueryIO.readTableRows()

How to solve Duplicate values exception when I create PCollectionView<Map<String,String>>

巧了我就是萌 提交于 2020-01-23 01:39:26
问题 I'm setting up a slow-changing lookup Map in my Apache-Beam pipeline. It continuously updates the lookup map. For each key in lookup map, I retrieve the latest value in the global window with accumulating mode. But it always meets Exception : org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalArgumentException: Duplicate values for mykey Is anything wrong with this snippet code? If I use .discardingFiredPanes() instead, I will lose information in the last emit. pipeline

Dataflow/apache beam: manage custom module dependencies

不打扰是莪最后的温柔 提交于 2020-01-22 22:58:11
问题 I have a .py pipeline using apache beam that import another module (.py), that is my custom module. I have a strucutre like this: ├── mymain.py └── myothermodule.py I import myothermodule.py in mymain.py like this: import myothermodule When I run locally on DirectRuner , I have no problem. But when I run it on dataflow with DataflowRunner , I have an error that tells: ImportError: No module named myothermodule So I want to know what should I do if I whant this module to be found when running

Dataflow/apache beam: manage custom module dependencies

守給你的承諾、 提交于 2020-01-22 22:57:06
问题 I have a .py pipeline using apache beam that import another module (.py), that is my custom module. I have a strucutre like this: ├── mymain.py └── myothermodule.py I import myothermodule.py in mymain.py like this: import myothermodule When I run locally on DirectRuner , I have no problem. But when I run it on dataflow with DataflowRunner , I have an error that tells: ImportError: No module named myothermodule So I want to know what should I do if I whant this module to be found when running