How to read a zip containing multiple files in Apache Spark

前端 未结 5 745
星月不相逢
星月不相逢 2020-12-06 18:41

I am having a Zipped file containing multiple text files. I want to read each of the file and build a List of RDD containining the content of each files.

val         


        
5条回答
  •  囚心锁ツ
    2020-12-06 19:35

    This filters only the first line. can anyone share your insights. I am trying to read a CSV file which is zipped and create JavaRDD for further processing.

    JavaPairRDD zipData =
                    sc.binaryFiles("hdfs://temp.zip");
            JavaRDD newRDDRecord = zipData.flatMap(
              new FlatMapFunction, Record>(){
                  public Iterator call(Tuple2 content) throws Exception {
                      List records = new ArrayList();
                          ZipInputStream zin = new ZipInputStream(content._2.open());
                          ZipEntry zipEntry;
                          while ((zipEntry = zin.getNextEntry()) != null) {
                              count++;
                              if (!zipEntry.isDirectory()) {
                                  Record sd;
                                  String line;
                                  InputStreamReader streamReader = new InputStreamReader(zin);
                                  BufferedReader bufferedReader = new BufferedReader(streamReader);
                                  line = bufferedReader.readLine();
                                  String[] records= new CSVParser().parseLineMulti(line);
                                  sd = new Record(TimeBuilder.convertStringToTimestamp(records[0]),
                                            getDefaultValue(records[1]),
                                            getDefaultValue(records[22]));
                                  records.add(sd);
                              }
                          }
    
                    return records.iterator();
                  }
    
            });
    

提交回复
热议问题