Best method for parsing date formats during import datas

僤鯓⒐⒋嵵緔 提交于 2020-01-04 06:33:12

问题


I created method for parsing a view different date formats during data import (400 K records). My method catches ParseException and trying to parse date with next format when it's different.

Question: Is better way(and faster) to set correct date format during data import?

private static final String DMY_DASH_FORMAT = "dd-MM-yyyy";
private static final String DMY_DOT_FORMAT = "dd.MM.yyyy";
private static final String YMD_DASH_FORMAT = "yyyy-MM-dd";
private static final String YMD_DOT_FORMAT = "yyyy.MM.dd";
private static final String SIMPLE_YEAR_FORMAT = "yyyy";
private final List<String> dateFormats = Arrays.asList(YMD_DASH_FORMAT, DMY_DASH_FORMAT,
        DMY_DOT_FORMAT, YMD_DOT_FORMAT);

private Date parseDateFromString(String date) throws ParseException {
    if (date.equals("0")) {
        return null;
    }
    if (date.length() == 4) {
        SimpleDateFormat simpleDF = new SimpleDateFormat(SIMPLE_YEAR_FORMAT);
        simpleDF.setLenient(false);
        return new Date(simpleDF.parse(date).getTime());
    }
    for (String format : dateFormats) {
        SimpleDateFormat simpleDF = new SimpleDateFormat(format);
        try {
            return new Date(simpleDF.parse(date).getTime());
        } catch (ParseException exception) {
        }
    }
    throw new ParseException("Unknown date format", 0);
} 

回答1:


Talking about 400K records, it might be reasonable to do some "bare hands" optimization here.

For example: if your incoming string has a "-" on position 5, then you know that the only (potentially) matching format would be "yyyy-MM-dd". If it is "."; you know that it is the other format that starts yyyy.

So, if you really want to optimize, you could fetch that character and see what it is. Could save 3 attempts of parsing with the wrong format!

Beyond that: I am not sure if sure if "dd" means that your other dates start with "01" ... or if "1.1.2016" would be possible, too. If all your dates always use two digits for dd/mm; then you can repeat that game - as you would fetch on position 3 - to choose between "dd...." and "dd-....".

Of course; there is one disadvantage - if you follow that idea, you are very much "hard-coding" the expected formats into your code; so adding other formats will become harder. On the other hand; you would save a lot.

Finally: the other thing that might greatly speed up things would be to use stream operations for reading/parsing that information; because then you could look into parallel streams, and simply exploit the ability of modern hardware to process 4, 8, 16 dates in parallel.




回答2:


If you're running single threaded, an obvious improvement is to create the SimpleDateFormat objects only once. In a multithreaded situation using ThreadLocal<SimpleDateFormat> would be required.

Also fix your exception handling. It looks like it's written by someone who shouldn't be trusted to import any data.




回答3:


For a similar problem statememt , i had used time4j library in the past. Here is an example. This uses the following dependencies given below as well

import java.text.ParseException;
import java.util.ArrayList;
import java.util.List;
import java.util.Locale;

import net.time4j.PlainDate;
import net.time4j.format.expert.ChronoFormatter;
import net.time4j.format.expert.MultiFormatParser;
import net.time4j.format.expert.ParseLog;
import net.time4j.format.expert.PatternType;

public class MultiDateParser {

    static final MultiFormatParser<PlainDate> MULTI_FORMAT_PARSER;

    static {

        ChronoFormatter<PlainDate> style1 = ChronoFormatter.ofDatePattern("dd-MM-yyyy", PatternType.CLDR,
                Locale.GERMAN);
        ChronoFormatter<PlainDate> style2 = ChronoFormatter.ofDatePattern("dd.MM.yyyy", PatternType.CLDR, Locale.US);

        ChronoFormatter<PlainDate> style3 = ChronoFormatter.ofDatePattern("yyyy-MM-dd", PatternType.CLDR, Locale.US);

        ChronoFormatter<PlainDate> style4 = ChronoFormatter.ofDatePattern("yyyy.MM.dd", PatternType.CLDR, Locale.US);

        //this is not supported
        //ChronoFormatter<PlainDate> style5 = ChronoFormatter.ofDatePattern("yyyy", PatternType.CLDR, Locale.US);

        MULTI_FORMAT_PARSER = MultiFormatParser.of(style1, style2, style3, style4);
    }

    public List<PlainDate> parse() throws ParseException {
        String[] input = { "11-09-2001", "09.11.2001", "2011-11-01", "2011.11.01", "2012" };
        List<PlainDate> dates = new ArrayList<>();
        ParseLog plog = new ParseLog();

        for (String s : input) {
            plog.reset(); // initialization
            PlainDate date = MULTI_FORMAT_PARSER.parse(s, plog);

            if (date == null || plog.isError()) {
                System.out.println("Wrong entry found: " + s + " at position " + dates.size() + ", error-message="
                        + plog.getErrorMessage());
            } else {
                dates.add(date);
            }
        }
        System.out.println(dates);
        return dates;
    }

    public static void main(String[] args) throws ParseException {
        MultiDateParser mdp = new MultiDateParser();
        mdp.parse();
    }

}

<dependency>
            <groupId>net.time4j</groupId>
            <artifactId>time4j-core</artifactId>
            <version>4.19</version>
        </dependency>

        <dependency>
            <groupId>net.time4j</groupId>
            <artifactId>time4j-misc</artifactId>
            <version>4.19</version>
        </dependency>

The case yyyy will have to be handled differently as it is not a date. May be similar logic that you have used (length ==4) is a choice.

The above code returns , you can check a quick perf run to see if this scales for the 400k records you have.

Wrong entry found: 2012 at position 4, error-message=Not matched by any format: 2012
[2001-09-11, 2001-11-09, 2011-11-01, 2011-11-01]


来源:https://stackoverflow.com/questions/39597959/best-method-for-parsing-date-formats-during-import-datas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!