strip data from text file using regex

问题

Im going to start by posting what the data from the text file looks like, this is just 4 lines of it, the actually file is a couple hundred lines long.

Friday, September  9 2011        5:00AM - 11:59PM       STH 1102                HOLD DO NOT BOOK                                                 
Report Printed on 9/08/2011 at  2:37 PM   Page 1 of 1 

Friday, September  9 2011        5:00AM - 11:00PM       STH 4155 (BOARDROOM)    HOLD - DO NOT BOOK                     
Hold - Do Not Book        Report Printed on 9/08/2011 at  2:37 PM   Page 1 of 1 

Friday, September  9 2011        5:00AM - 11:59PM       UC 2 (COMPUTER LAB)     HOLD DO NOT BOOK                       
do not book               Report Printed on 9/08/2011 at  2:37 PM   Page 1 of 1 

Friday, September  9 2011        5:00PM - 11:00PM       AH GYM                  USC ORIENTATION 2011 - REVISED         
USC Orientation 2011      Report Printed on 9/08/2011 at  2:37 PM   Page 1 of 1

Each little section of text is on one line in the text file, separated by many spaces which dont show up in the question format for some reason, I will use the first section of text as an example of what data I am trying to get.

Here is the data Id like to get from the file Friday, 5:00, 11:59, STH 1102, HOLD DO NOT BOOK, and then ignore the read of the line, all the info on the 2nd line of the section of text is too be ignored, but in the text file itself it is all on one line. and then with this data, I would like to save each piece into a variable. Or instead the part of the data that says HOLD DO NOT BOOK is sometimes formatted like this: DO NOT BOOK, HOLD - DO NOT BOOK, if the regex finds any of theses it can ignore all the data in that line before and after.

Also if you I could I would like to take the times that have PM in them and add 12 to them so they are in 24 hour format.

Here is how I am currently reading the lines entirely. And then I just call this function once the user has put the path in the scheduleTxt JTextfield. It can read and print each line out fine.

public void readFile () throws IOException
    {
        try
        {
            FileInputStream fstream = new FileInputStream(scheduleTxt.getText());
            DataInputStream in = new DataInputStream(fstream);
            BufferedReader br = new BufferedReader(new InputStreamReader(in));
            String strLine;
            while ((strLine = br.readLine()) != null)   
            {
                        System.out.println (str1);      
            }
            in.close();
        }
        catch (Exception e){
            System.err.println("Error: " + e.getMessage());
        }
    }

I know theres in this question, hopefully you understand what Im asking, if something is unclear just ask, Thanks! Beef.

Update: I just thought maybe it would help to explain my intentions for this data, first off I will be converting any PM times into 24 hour format, and then according to the 4th piece of data(STH 1102) I call a insert function that will use the ODBC driver in order to insert the other data from the line into a database

回答1:

Those look like tabs between the fields. If I were you, I'd use non-regex text manipulation to split the first of every three lines on the \t character. That should give you STH 1102 and HOLD DO NOT BOOK without any further processing.

That leaves Friday, 5:00, and 11:59. You can still get those with text manipulation: Split Friday, September on the comma and take the first segment, then split 5:00AM - 11:59PM on the string - (a hyphen with spaces around it).

If you still want regexes for those, you can use "[A-Za-z]+(?=,)" and "(\\d{1,2}:\\d{2}[PA]M) - (\\d{1,2}:\\d{2}[PA]M)", respectively. The second pattern will return the times you want in capture groups 1 and 2.

Regex for the whole thing is probably not the best way to do it, but this will probably work:

"^([^,]+),.*\\t(\\d{1,2}:\\d{2}[PA]M) - (\\d{1,2}:\\d{2}[PA]M)\\t([^\\t]+)\\t([^\\t]+)$"

Values you want will be in capture groups 1 - 5.

Edit:

Since you've indicated that those aren't tabs between the groups, the above regex won't work as-is. However, that probably means that the fields are at fixed positions. Find out at which index each group starts, then use String.substring to select everything from there to the next group and String.trim the result. You can then process the day-of-week and time portions as I described above: "[A-Za-z]+(?=,)" and "(\\d{1,2}:\\d{2}[PA]M) - (\\d{1,2}:\\d{2}[PA]M)", or non-regex string manipulation.

Also, if there is in fact a tab before the first "time" value, that might mess up the positioning. Split the string on that tab and use the substring method I described on the right-hand portion. The left-hand portion can be split on , to find the day of the week.

回答2:

I think it's worth splitting the text using StringTokenizer or String.split() and accessing each section by it's position in the string. A regex is going to be just as fragile and far more complicated to write.

来源：https://stackoverflow.com/questions/7432018/strip-data-from-text-file-using-regex

标签

java

regex

text

fileinputstream

datainputstream