apache-beam | 易学教程

How to log incoming messages in apache beam pipeline

阅读更多关于 How to log incoming messages in apache beam pipeline

问题 I am writing a simple apache beam streaming pipeline, taking input from a pubsub topic and storing this into bigquery. For hours I thought I am not able to even read a message, as I was simply trying to log the input to console: events = p | 'Read PubSub' >> ReadFromPubSub(subscription=SUBSCRIPTION) logging.info(events) When I write this to text it works fine! However my call to the logger never happens. How to people develop / debug these streaming pipelines? I have tried adding the

Can't read from kafka by KafkaIO in beam

阅读更多关于 Can't read from kafka by KafkaIO in beam

问题 I have written a very simpel pipeline in Apchea Beam as follow to read data from my kafka cluster on Confluent Cloud as follow: Pipeline pipeline = Pipeline.create(options); Map<String, Object> propertyBuilder = new HashMap(); propertyBuilder.put("ssl.endpoint.identification.algorithm", "https"); propertyBuilder.put("sasl.mechanism","PLAIN"); propertyBuilder.put("request.timeout.ms","20000"); propertyBuilder.put("retry.backoff.ms","500"); pipeline .apply(KafkaIO.<byte[], byte[]>readBytes()

Can't read from kafka by KafkaIO in beam

阅读更多关于 Can't read from kafka by KafkaIO in beam

TypeError: Receiver() takes no arguments (when running any pipeline with DirectRunner)

阅读更多关于 TypeError: Receiver() takes no arguments (when running any pipeline with DirectRunner)

问题 I had my pipeline working locally, with DirectRunner, and suddenly it started to fail. To make sure it was not me messing with the code, I tested again with the standard wordcount.py example. Same result: TypeError: Receiver() takes no arguments Any idea of why am I getting that error now with any pipeline, please? Python version is 3.7.6, Apache Beam version is 2.16.0. Command line output below (wordcount.py example code, untouched): (venv37) PS C:\Users\mnight\Documents\myproject

How do I perform a “diff” on two Sources given a key using Apache Beam Python SDK?

阅读更多关于 How do I perform a “diff” on two Sources given a key using Apache Beam Python SDK?

问题 I posed the question generically, because maybe it is a generic answer. But a specific example is comparing 2 BigQuery tables with the same schema, but potentially different data. I want a diff, i.e. what was added, deleted, modified, with respect to a composite key, e.g. the first 2 columns. Table A C1 C2 C3 ----------- a a 1 a b 1 a c 1 Table B C1 C2 C3 # Notes if comparing B to A ------------------------------------- a a 1 # No Change to the key a + a a b 2 # Key a + b Changed from 1 to 2

Shutting down JVM after 8 consecutive periods of measured GC thrashing

阅读更多关于 Shutting down JVM after 8 consecutive periods of measured GC thrashing

问题 I am writing Apache beam BAtch dataflow in which I am writing from GCS to BQ. My data contains 4 millions of records . I have specified n1-HighMem-8 machine type.My dataflow works form small amount of data. I my use case I schema is not fixed so I have used .getFailedInserts() Method to get schema failed records not inserted. I have grouped them and writing to BQ using BQ load job via GCS in same dataflow. for this amount of data I am geting following error 7 time and then my dataflow errors

Shutting down JVM after 8 consecutive periods of measured GC thrashing

阅读更多关于 Shutting down JVM after 8 consecutive periods of measured GC thrashing

Migration from DynamoDB to Spanner/BigTable

阅读更多关于 Migration from DynamoDB to Spanner/BigTable

问题 I have a use case where I need to migrate 70 TB of data from DynamoDB to BigTable and Spanner. Tables with a single index will go to BigTable else they will go to Spanner. I can easily handle the historical loads by exporting the data to S3 --> GCS --> Spanner/BigTable. But the challenging part is to handle the incremental streaming loads simultaneously happening on DynamoDB. There are 300 tables in DynamoDB. How to handle this thing in the best possible manner? Has anyone done this before?

Join 2 JSON inputs linked by Primary Key

阅读更多关于 Join 2 JSON inputs linked by Primary Key

问题 I am trying to merge 2 JSON inputs (this example is from a file, but it will be from a Google Pub Sub input later) from these: orderID.json: {"orderID":"test1","orderPacked":"Yes","orderSubmitted":"Yes","orderVerified":"Yes","stage":1} combined.json: {"barcode":"95590","name":"Ash","quantity":6,"orderID":"test1"} {"barcode":"95591","name":"Beat","quantity":6,"orderID":"test1"} {"barcode":"95592","name":"Cat","quantity":6,"orderID":"test1"} {"barcode":"95593","name":"Dog","quantity":6,"orderID

Dataflow GroupBy -> multiple outputs based on keys

阅读更多关于 Dataflow GroupBy -> multiple outputs based on keys

问题 Is there any simple way that I can redirect the output of GroupBy into multiple output files based on Group keys? Bin.apply(GroupByKey.<String, KV<Long,Iterable<TableRow>>>create()) .apply(ParDo.named("Print Bins").of( ... ) .apply(TextIO.Write.to(*Output file based on key*)) If Sink is the solution, would you please share a sample code w/ me? Thanks! 回答1: Beam 2.2 will include an API to do just that - TextIO.write().to(DynamicDestinations) , see source. For now, if you'd like to use this API