I have a java server app that download CSV file and parse it. The parsing can take from 5 to 45 minutes, and happens each hour.This method is a bottleneck of the app so it\'
For speed you do not want to use replaceAll, and you don't want to use regex either. What you basically always want to do in critical cases like that is making a state-machine character by character parser. I've done that having rolled the whole thing into an Iterable function. It also takes in the stream and parses it without saving it out or caching it. So if you can abort early that's likely going to go fine as well. It should also be short enough and well coded enough to make it obvious how it works.
public static Iterable parseCSV(final InputStream stream) throws IOException {
return new Iterable() {
@Override
public Iterator iterator() {
return new Iterator() {
static final int UNCALCULATED = 0;
static final int READY = 1;
static final int FINISHED = 2;
int state = UNCALCULATED;
ArrayList value_list = new ArrayList<>();
StringBuilder sb = new StringBuilder();
String[] return_value;
public void end() {
end_part();
return_value = new String[value_list.size()];
value_list.toArray(return_value);
value_list.clear();
}
public void end_part() {
value_list.add(sb.toString());
sb.setLength(0);
}
public void append(int ch) {
sb.append((char) ch);
}
public void calculate() throws IOException {
boolean inquote = false;
while (true) {
int ch = stream.read();
switch (ch) {
default: //regular character.
append(ch);
break;
case -1: //read has reached the end.
if ((sb.length() == 0) && (value_list.isEmpty())) {
state = FINISHED;
} else {
end();
state = READY;
}
return;
case '\r':
case '\n': //end of line.
if (inquote) {
append(ch);
} else {
end();
state = READY;
return;
}
break;
case ',': //comma
if (inquote) {
append(ch);
} else {
end_part();
break;
}
break;
case '"': //quote.
inquote = !inquote;
break;
}
}
}
@Override
public boolean hasNext() {
if (state == UNCALCULATED) {
try {
calculate();
} catch (IOException ex) {
}
}
return state == READY;
}
@Override
public String[] next() {
if (state == UNCALCULATED) {
try {
calculate();
} catch (IOException ex) {
}
}
state = UNCALCULATED;
return return_value;
}
};
}
};
}
You would typically process this quite helpfully like:
for (String[] csv : parseCSV(stream)) {
//
}
The beauty of that API there is worth the rather cryptic looking function.