skip header while reading a CSV file in Apache Beam

后端 未结 3 845
花落未央
花落未央 2021-01-06 05:52

I want to skip header line from a CSV file. As of now I\'m removing the header manually before loading it to google storage.

Below is my code :

PCo         


        
3条回答
  •  青春惊慌失措
    2021-01-06 06:15

    The older Stack Overflow post that you shared (Skipping header rows - is it possible with Cloud DataFlow?) does contain the answer to your question.

    This option is currently not available in the Apache Beam SDK, although there is an open Feature Request in the Apache Beam JIRA issue tracker, BEAM-123. Note that, as of writing, this feature request is still open and unresolved, and it has been like that for 2 years already. However, it looks like some effort is being done in that sense, and the latest update in the issue is from February 2018, so I would advise you to stay updated on that JIRA issue, as it was last moved to the sdk-java-core component, and it may be getting more attention there.

    With that information in mind, I would say that the approach you are using (removing the header before uploading the file to GCS) is the best option for you. I would refrain from doing it manually, as you can easily script that and automate the remove headerupload file process.


    EDIT:

    I have been able to come up with a simple filter using a DoFn. It might not be the most elegant solution (I am not an Apache Beam expert myself), but it does work, and you may be able to adapt it to your needs. It requires that you know beforehand the header of the CSV files being uploaded (as it will be filtering by element content), but again, take this just as a template that you may be able to modify to your needs:

    public class RemoveCSVHeader {
      // The Filter class
      static class FilterCSVHeaderFn extends DoFn {
        String headerFilter;
    
        public FilterCSVHeaderFn(String headerFilter) {
          this.headerFilter = headerFilter;
        }
    
        @ProcessElement
        public void processElement(ProcessContext c) {
          String row = c.element();
          // Filter out elements that match the header
          if (!row.equals(this.headerFilter)) {
            c.output(row);
          }
        }
      }
    
      // The main class
      public static void main(String[] args) throws IOException {
        PipelineOptions options = PipelineOptionsFactory.create();
        Pipeline p = Pipeline.create(options);
    
        PCollection vals = p.apply(TextIO.read().from("gs://BUCKET/FILE.csv"));
    
        String header = "col1,col2,col3,col4";
    
        vals.apply(ParDo.of(new FilterCSVHeaderFn(header)))
            .apply(TextIO.write().to("out"));
    
        p.run().waitUntilFinish();
      }
    }
    

提交回复
热议问题