Fast CSV parsing

前端 未结 9 1361
名媛妹妹
名媛妹妹 2020-11-28 10:25

I have a java server app that download CSV file and parse it. The parsing can take from 5 to 45 minutes, and happens each hour.This method is a bottleneck of the app so it\'

9条回答
  •  遥遥无期
    2020-11-28 11:26

    For speed you do not want to use replaceAll, and you don't want to use regex either. What you basically always want to do in critical cases like that is making a state-machine character by character parser. I've done that having rolled the whole thing into an Iterable function. It also takes in the stream and parses it without saving it out or caching it. So if you can abort early that's likely going to go fine as well. It should also be short enough and well coded enough to make it obvious how it works.

    public static Iterable parseCSV(final InputStream stream) throws IOException {
        return new Iterable() {
            @Override
            public Iterator iterator() {
                return new Iterator() {
                    static final int UNCALCULATED = 0;
                    static final int READY = 1;
                    static final int FINISHED = 2;
                    int state = UNCALCULATED;
                    ArrayList value_list = new ArrayList<>();
                    StringBuilder sb = new StringBuilder();
                    String[] return_value;
    
                    public void end() {
                        end_part();
                        return_value = new String[value_list.size()];
                        value_list.toArray(return_value);
                        value_list.clear();
                    }
    
                    public void end_part() {
                        value_list.add(sb.toString());
                        sb.setLength(0);
                    }
    
                    public void append(int ch) {
                        sb.append((char) ch);
                    }
    
                    public void calculate() throws IOException {
                        boolean inquote = false;
                        while (true) {
                            int ch = stream.read();
                            switch (ch) {
                                default: //regular character.
                                    append(ch);
                                    break;
                                case -1: //read has reached the end.
                                    if ((sb.length() == 0) && (value_list.isEmpty())) {
                                        state = FINISHED;
                                    } else {
                                        end();
                                        state = READY;
                                    }
                                    return;
                                case '\r':
                                case '\n': //end of line.
                                    if (inquote) {
                                        append(ch);
                                    } else {
                                        end();
                                        state = READY;
                                        return;
                                    }
                                    break;
                                case ',': //comma
                                    if (inquote) {
                                        append(ch);
                                    } else {
                                        end_part();
                                        break;
                                    }
                                    break;
                                case '"': //quote.
                                    inquote = !inquote;
                                    break;
                            }
                        }
                    }
    
                    @Override
                    public boolean hasNext() {
                        if (state == UNCALCULATED) {
                            try {
                                calculate();
                            } catch (IOException ex) {
                            }
                        }
                        return state == READY;
                    }
    
                    @Override
                    public String[] next() {
                        if (state == UNCALCULATED) {
                            try {
                                calculate();
                            } catch (IOException ex) {
                            }
                        }
                        state = UNCALCULATED;
                        return return_value;
                    }
                };
            }
        };
    }
    

    You would typically process this quite helpfully like:

    for (String[] csv : parseCSV(stream)) {
        //
    }
    

    The beauty of that API there is worth the rather cryptic looking function.

提交回复
热议问题