apache-beam

Batch PCollection in Beam/Dataflow

天涯浪子 提交于 2019-12-02 10:12:37
问题 I have a PCollection in GCP Dataflow/Apache Beam. Instead of processing it one by one, I need to combine "by N". Something like grouped(N) . So, in case of bounded processing, it will group by 10 items in batch and last batch with whatever left. Is this possible in Apache Beam? 回答1: Edit, looks like: Google Dataflow "elementCountExact" aggregation You should be able to do something similar by assigning elements to global window and using AfterPane.elementCountAtLeast(N) . You still need to

Using start_bundle() in apache-beam job not working. Unpickleable storage.Client()

妖精的绣舞 提交于 2019-12-02 10:06:33
问题 I'm getting this error pickle.PicklingError: Pickling client objects is explicitly not supported. Clients have non-trivial state that is local and unpickleable. When trying to use beam.ParDo to call a function that looks like this class ExtractBlobs(beam.DoFn): def start_bundle(self): self.storageClient = storage.Client() def process(self, element): client = self.storageClient bucket = client.get_bucket(element) blobs = list(bucket.list_blobs(max_results=100)) return blobs I thought the whole

AssertionError: assertion failed: copyAndReset must return a zero value copy

时光总嘲笑我的痴心妄想 提交于 2019-12-02 09:44:21
When I applied ParDo.of(new ParDoFn()) to PCollection named textInput , The program throws this Exception. But The Program is normal when I delete .apply(ParDo.of(new ParDoFn())) . //SparkRunner private static void testHadoop(Pipeline pipeline){ Class<? extends FileInputFormat<LongWritable, Text>> inputFormatClass = (Class<? extends FileInputFormat<LongWritable, Text>>) (Class<?>) TextInputFormat.class; @SuppressWarnings("unchecked") //hdfs://localhost:9000 HadoopIO.Read.Bound<LongWritable, Text> readPTransfom_1 = HadoopIO.Read.from("hdfs://localhost:9000/tmp/kinglear.txt", inputFormatClass,

Unable to read XML File stored in GCS Bucket

人盡茶涼 提交于 2019-12-02 09:28:59
I have tried to follow this documentation in the most precise way I could: https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/io/xml/XmlIO.html Please find below my codes : public static void main(String args[]) { DataflowPipelineOptions options=PipelineOptionsFactory.as(DataflowPipelineOptions.class); options.setTempLocation("gs://balajee_test/stagging"); options.setProject("test-1-130106"); Pipeline p=Pipeline.create(options); PCollection<XMLFormatter> record= p.apply(XmlIO.<XMLFormatter>read() .from("gs://balajee_test/sample_3.xml") .withRootElement("book")

Why is my RabbitMQ message impossible to serialize using Apache Beam?

血红的双手。 提交于 2019-12-02 08:22:55
I'm trying to read a RabbitMQ queue using Apache Beam. I've written some transformation code to have messages written to Kafka. I've even tested my scenario using simple text messages. Now I try to deploy it with the effective messages this transformer is made to run on. These are JSON message of a quite moderate size. Strangely, when i try to read "production" messages, I get this exception stack trace. java.lang.IllegalArgumentException: Unable to encode element 'ValueWithRecordId{id=[], value=org.apache.beam.sdk.io.rabbitmq.RabbitMqMessage@f179a7f}' with coder 'ValueWithRecordId

Google Cloud Dataflow Worker Threading

青春壹個敷衍的年華 提交于 2019-12-02 08:14:11
Say we have one worker with 4 CPU cores. How does parallelism configured in Dataflow worker machines? Do we parallelize beyond # of cores? Where would this type of information be available? One worker thread is used per core, and each worker thread independently processes a chunk of the input space. 来源: https://stackoverflow.com/questions/47777639/google-cloud-dataflow-worker-threading

reading files and folders in order with apache beam

六月ゝ 毕业季﹏ 提交于 2019-12-02 07:59:55
I have a folder structure of the type year/month/day/hour/* , and I'd like the beam to read this as an unbounded source in chronological order. Specifically, this means reading in all the files in the first hour on record and adding their contents for processing. Then, add the file contents of the next hour for processing, up until the current time where it waits for new files to arrive in the latest year/month/day/hour folder. Is it possible to do this with apache beam? So what I would do is to add timestamps to each element according to the file path. As a test I used the following example.

Error while splitting pcollections on Dataflow runner

荒凉一梦 提交于 2019-12-02 07:30:18
问题 I have an Apache Beam pipeline built in python. I am reading rows from a csv file.Then there are generic pipeline steps for All pcollections. This works fine. For pcollections which come from a specific filename, I want to perform couple of additional steps. Therefore I tag the pcollections in that file and run additional steps for those tagged collections. Wehn I run the pipeline on 'Dataflow' it gives me the error "Workflow failed. Causes: Expected custom source to have non-zero number of

Google-cloud-dataflow: Why pipeline run twice with DirectRunner?

六月ゝ 毕业季﹏ 提交于 2019-12-02 06:24:44
问题 Given the data set as below {"slot":"reward","result":1,"rank":1,"isLandscape":false,"p_type":"main","level":1276,"type":"ba","seqNum":42544} {"slot":"reward_dlg","result":1,"rank":1,"isLandscape":false,"p_type":"main","level":1276,"type":"ba","seqNum":42545} I try to filter those json data through type:ba and insert them into bigquery with python sdk ba_schema = 'slot:STRING,result:INTEGER,play_type:STRING,level:INTEGER' class ParseJsonDoFn(beam.DoFn): B_TYPE = 'tag_B' def process(self,

Does Dataflow templating supports template input for BigQuery sink options?

﹥>﹥吖頭↗ 提交于 2019-12-02 05:39:21
As I have a working static Dataflow running, I'd like to create a template from this one to let me easily reuse the Dataflow without any command line typing. Following the Creating Templates tutorial from the official doesn't provide a sample for templatable output. My Dataflow ends with a BigQuery sink which takes a few arguments like the target table for storage. This exact parameter is the one I'd like to make available in my template allowing me to choose the target storage after running the flow. But, I'm not able to get this working. Below I paste some code snippets which could help