apache-beam

How to log incoming messages in apache beam pipeline

为君一笑 提交于 2020-01-16 19:06:49
问题 I am writing a simple apache beam streaming pipeline, taking input from a pubsub topic and storing this into bigquery. For hours I thought I am not able to even read a message, as I was simply trying to log the input to console: events = p | 'Read PubSub' >> ReadFromPubSub(subscription=SUBSCRIPTION) logging.info(events) When I write this to text it works fine! However my call to the logger never happens. How to people develop / debug these streaming pipelines? I have tried adding the

Can't read from kafka by KafkaIO in beam

纵然是瞬间 提交于 2020-01-16 16:31:53
问题 I have written a very simpel pipeline in Apchea Beam as follow to read data from my kafka cluster on Confluent Cloud as follow: Pipeline pipeline = Pipeline.create(options); Map<String, Object> propertyBuilder = new HashMap(); propertyBuilder.put("ssl.endpoint.identification.algorithm", "https"); propertyBuilder.put("sasl.mechanism","PLAIN"); propertyBuilder.put("request.timeout.ms","20000"); propertyBuilder.put("retry.backoff.ms","500"); pipeline .apply(KafkaIO.<byte[], byte[]>readBytes()

Can't read from kafka by KafkaIO in beam

ぃ、小莉子 提交于 2020-01-16 16:31:08
问题 I have written a very simpel pipeline in Apchea Beam as follow to read data from my kafka cluster on Confluent Cloud as follow: Pipeline pipeline = Pipeline.create(options); Map<String, Object> propertyBuilder = new HashMap(); propertyBuilder.put("ssl.endpoint.identification.algorithm", "https"); propertyBuilder.put("sasl.mechanism","PLAIN"); propertyBuilder.put("request.timeout.ms","20000"); propertyBuilder.put("retry.backoff.ms","500"); pipeline .apply(KafkaIO.<byte[], byte[]>readBytes()

TypeError: Receiver() takes no arguments (when running any pipeline with DirectRunner)

我是研究僧i 提交于 2020-01-16 08:42:17
问题 I had my pipeline working locally, with DirectRunner, and suddenly it started to fail. To make sure it was not me messing with the code, I tested again with the standard wordcount.py example. Same result: TypeError: Receiver() takes no arguments Any idea of why am I getting that error now with any pipeline, please? Python version is 3.7.6, Apache Beam version is 2.16.0. Command line output below (wordcount.py example code, untouched): (venv37) PS C:\Users\mnight\Documents\myproject

How do I perform a “diff” on two Sources given a key using Apache Beam Python SDK?

萝らか妹 提交于 2020-01-15 07:27:12
问题 I posed the question generically, because maybe it is a generic answer. But a specific example is comparing 2 BigQuery tables with the same schema, but potentially different data. I want a diff, i.e. what was added, deleted, modified, with respect to a composite key, e.g. the first 2 columns. Table A C1 C2 C3 ----------- a a 1 a b 1 a c 1 Table B C1 C2 C3 # Notes if comparing B to A ------------------------------------- a a 1 # No Change to the key a + a a b 2 # Key a + b Changed from 1 to 2

Shutting down JVM after 8 consecutive periods of measured GC thrashing

こ雲淡風輕ζ 提交于 2020-01-15 04:44:11
问题 I am writing Apache beam BAtch dataflow in which I am writing from GCS to BQ. My data contains 4 millions of records . I have specified n1-HighMem-8 machine type.My dataflow works form small amount of data. I my use case I schema is not fixed so I have used .getFailedInserts() Method to get schema failed records not inserted. I have grouped them and writing to BQ using BQ load job via GCS in same dataflow. for this amount of data I am geting following error 7 time and then my dataflow errors

Shutting down JVM after 8 consecutive periods of measured GC thrashing

别来无恙 提交于 2020-01-15 04:42:38
问题 I am writing Apache beam BAtch dataflow in which I am writing from GCS to BQ. My data contains 4 millions of records . I have specified n1-HighMem-8 machine type.My dataflow works form small amount of data. I my use case I schema is not fixed so I have used .getFailedInserts() Method to get schema failed records not inserted. I have grouped them and writing to BQ using BQ load job via GCS in same dataflow. for this amount of data I am geting following error 7 time and then my dataflow errors

Migration from DynamoDB to Spanner/BigTable

左心房为你撑大大i 提交于 2020-01-14 10:48:26
问题 I have a use case where I need to migrate 70 TB of data from DynamoDB to BigTable and Spanner. Tables with a single index will go to BigTable else they will go to Spanner. I can easily handle the historical loads by exporting the data to S3 --> GCS --> Spanner/BigTable. But the challenging part is to handle the incremental streaming loads simultaneously happening on DynamoDB. There are 300 tables in DynamoDB. How to handle this thing in the best possible manner? Has anyone done this before?

Join 2 JSON inputs linked by Primary Key

三世轮回 提交于 2020-01-11 11:28:11
问题 I am trying to merge 2 JSON inputs (this example is from a file, but it will be from a Google Pub Sub input later) from these: orderID.json: {"orderID":"test1","orderPacked":"Yes","orderSubmitted":"Yes","orderVerified":"Yes","stage":1} combined.json: {"barcode":"95590","name":"Ash","quantity":6,"orderID":"test1"} {"barcode":"95591","name":"Beat","quantity":6,"orderID":"test1"} {"barcode":"95592","name":"Cat","quantity":6,"orderID":"test1"} {"barcode":"95593","name":"Dog","quantity":6,"orderID

Dataflow GroupBy -> multiple outputs based on keys

本小妞迷上赌 提交于 2020-01-07 05:02:13
问题 Is there any simple way that I can redirect the output of GroupBy into multiple output files based on Group keys? Bin.apply(GroupByKey.<String, KV<Long,Iterable<TableRow>>>create()) .apply(ParDo.named("Print Bins").of( ... ) .apply(TextIO.Write.to(*Output file based on key*)) If Sink is the solution, would you please share a sample code w/ me? Thanks! 回答1: Beam 2.2 will include an API to do just that - TextIO.write().to(DynamicDestinations) , see source. For now, if you'd like to use this API