How to retrieve all kinds of dates and temporal values from text

浪尽此生 提交于 2019-12-12 08:13:01

问题


I wanted to retrieve dates and other temporal entities from a set of Strings. Can this be done without parsing the string for dates in JAVA as most parsers deal with a limited scope of input patterns. But input is a manual entry which here and hence ambiguous.

Inputs can be like:

12th Sep |mid-March |12.September.2013

Sep 12th |12th September| 2013

Sept 13 |12th, September |12th,Feb,2013

I've gone through many answers on finding date in Java but most of them don't deal with such a huge scope of input patterns.

I've tried using SimpleDateFormat class and using some parse() functions to check if parse function breaks which mean its not a date. I've tried using regex but I'm not sure if it falls fit in this scenario. I've also used ClearNLP to annotate the dates but it doesn't give a reliable annotation set.

The closest approach to getting these values could be using a Chain of responsibility as mentioned below. Is there a library that has a set of patterns for date. I can use that maybe?


回答1:


A clean and modular approach to this problem would be to use a chain, every element of the chain tries to match the input string against a regex, if the regex matches the input string than you can convert the input string to something that can feed a SimpleDateFormat to convert it to the data structure you prefer (Date? or a different temporal representation that better suits your needs) and return it, if the regexp doesn't matches the chain element just delegates to the next element in the chain.

The responsibility of every element of the chain is just to test the regex against the string, give a result or ask the next element of the chain to give it a try.

The chain can be created and composed easily without having to change the implementation of every element of the chain.

In the end the result is the same as in @KirkoR response, with a 'bit' (:D) more code but a modular approach. (I prefer the regex approach to the try/catch one)

Some reference: https://en.wikipedia.org/wiki/Chain-of-responsibility_pattern




回答2:


You could just implement support for all the pattern possibilities you can think of, then document that ... OK, these are all patterns my module supports. You could then throw some RuntimeException for all the other possibilities.

Then ... in an iterative way you can keep running your module over the input data, and keep adding support for more date formats until it stops raising any RuntimeException.

I think that's the best you can do here if you want to keep it reasonably simple.




回答3:


Yes! I've finally extracted all sorts of dates/temporal values that can be as generic as :

mid-March | Last Month | 9/11

To as specific as:

11/11/11 11:11:11

This finally happened because of awesome libraries from GATE and JAPE

I've created a more lenient annotation rule in JAPE say 'DateEnhanced' to include certain kinds of dates like "9/11 or 11TH, February- 2001" and used a Chaining of Java regex on R.H.S. of the 'DateEnhanced' annotations JAPE RULE, to filter some unwanted outputs.




回答4:


I can recommend to you very nice implementation of your problem, unfortunetlly in polish: http://koziolekweb.pl/2015/04/15/throw-to-taki-inny-return/

You can use google translator:

https://translate.google.pl/translate?sl=pl&tl=en&js=y&prev=_t&hl=en&ie=UTF-8&u=http%3A%2F%2Fkoziolekweb.pl%2F2015%2F04%2F15%2Fthrow-to-taki-inny-return&edit-text=

The code there looks really nice:

private static Date convertStringToDate(String s) {                           
    if (s == null || s.trim().isEmpty()) return null;                         
    ArrayList<String> patterns = Lists.newArrayList(YYYY_MM_DD_T_HH_MM_SS_SSS,
            YYYY_MM_DD_T_HH_MM_SS                                             
            , YYYY_MM_DD_T_HH_MM                                              
            , YYYY_MM_DD);                                                    
    for (String pattern : patterns) {                                         
        try {                                                                 
            return new SimpleDateFormat(pattern).parse(s);                    
        } catch (ParseException e) {                                          
        }                                                                     
    }                                                                         
    return new Date(Long.valueOf(s));                                         
}



回答5:


    mark.util.DateParser dp = new DateParser();
    ParsePositionEx parsePosition = new ParsePositionEx(0);
    Date startDate = dp.parse("12.September.2013", parsePosition);
    System.out.println(startDate);

output: Thu Sep 12 17:18:18 IST 2013

mark.util.Dateparser is a part of library which is used by DateNormalizer PR. So in Jape file, we have to just import it.



来源:https://stackoverflow.com/questions/33098511/how-to-retrieve-all-kinds-of-dates-and-temporal-values-from-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!