Google Cloud dataflow : Read from a file with dynamic filename

我怕爱的太早我们不能终老 提交于 2019-12-08 04:18:05

问题


I am trying to build a pipeline on Google Cloud Dataflow that would do the following:

  • Listen to events on Pubsub subscription
  • Extract the filename from event text
  • Read the file (from Google Cloud Storage bucket)
  • Store the records in BigQuery

Following is the code:

Pipeline pipeline = //create pipeline
pipeline.apply("read events", PubsubIO.readStrings().fromSubscription("sub"))
        .apply("Deserialise events", //Code that produces ParDo.SingleOutput<String, KV<String, byte[]>>)
        .apply(TextIO.read().from(""))???

I am struggling with 3rd step, not quite sure how to access the output of second step and use it in 3rd. I have tried writing the code that produces the following:

private ParDo.SingleOutput<KV<String, byte[]>, TextIO.Read> readFile(){
    //A class that extends DoFn<KV<String, byte[]>, TextIO.Read> and has TextIO.read wrapped into processElement method
}

However, I am not able to read the file content in subsequent step.

Could anyone please me know what do I need to write in 3rd and 4th steps so that I can consume the file line by line and store the output to BigQuery (or just log it).


回答1:


The natural way to express your read would be by using TextIO.readAll() method, which reads text files from an input PCollection of file names. This method has been introduced within the Beam codebase, but is not currently in a released version. It will be included in the Beam 2.2.0 release and the corresponding Dataflow 2.2.0 release.




回答2:


You can get this done with using SerializableFunction.

You can do

pipeline.apply(TextIO.read().from(new FileNameFn()));

public class FileNameFn implements SerializableFunction<inputFileNameString, outputQualifiedFileNameStringWithBucket>

Obvious you can pass bucket name and other parameter statically while creating this class instance by constructor arguments.

Hope this will help.



来源:https://stackoverflow.com/questions/46345738/google-cloud-dataflow-read-from-a-file-with-dynamic-filename

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!