How to determine the delimiter in CSV file

前端 未结 5 1255
旧时难觅i
旧时难觅i 2020-12-16 03:26

I have a scenario at which i have to parse CSV files from different sources, the parsing code is very simple and straightforward.

        String csvFile = \         


        
5条回答
  •  庸人自扰
    2020-12-16 03:55

    Yes, but only if the delimiter characters are not allowed to exist as regular text

    The most simple answer is to have a list with all the available delimiter characters and try to identify which character is being used. Even though, you have to place some limitations on the files or the person/people that created them. Look a the following two scenarios:

    Case 1 - Contents of file.csv

    test,test2,test3
    

    Case 2 - Contents of file.csv

    test1|test2,3|test4
    

    If you have prior knowledge of the delimiter characters, then you would split the first string using , and the second one using |, getting the same result. But, if you try to identify the delimiter by parsing the file, both strings can be split using the , character, and you would end up with this:

    Case 1 - Result of split using ,

    test1
    test2
    test3
    

    Case 2 - Result of split using ,

    test1|test2
    3|test4
    

    By lacking the prior knowledge of which delimiter character is being used, you cannot create a "magical" algorithm that will parse every combination of text; even regular expressions or counting the number of appearance of a character will not save you.

    Worst case

    test1,2|test3,4|test5
    

    By looking the text, one can tokenize it by using | as the delimiter. But the frequency of appearance of both , and | are the same. So, from an algorithm's perspective, both results are accurate:

    Correct result

    test1,2
    test3,4
    test5
    

    Wrong result

    test1
    2|test3
    4|test5
    

    If you pose a set of guidelines or you can somehow control the generation of the CSV files, then you could just try to find the delimiter used with String.contains() method, employing the aforementioned list of characters. For example:

    public class MyClass {
    
        private List delimiterList = new ArrayList<>(){{
            add(",");
            add(";");
            add("\t");
            // etc...
        }};
    
        private static String determineDelimiter(String text) {
            for (String delimiter : delimiterList) {
                if(text.contains(delimiter)) {
                    return delimiter;
                }
            }
            return "";
        }
    
        public static void main(String[] args) {
            String csvFile = "/Users/csv/country.csv";
            String line = "";
            String cvsSplitBy = ",";
            String delimiter = "";
            boolean firstLine = true;
            try (BufferedReader br = new BufferedReader(new FileReader(csvFile)))  {
                while ((line = br.readLine()) != null) {
                    if(firstLine) {
                        delimiter = determineDelimiter(line);
                        if(delimiter.equalsIgnoreCase("")) {
                            System.out.println("Unsupported delimiter found: " + delimiter);
                            return;
                        }
                        firstLine = false;
                    }
                    // use comma as separator
                    String[] country = line.split(delimiter);
                    System.out.println("Country [code= " + country[4] + " , name=" + country[5] + "]");
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
    

    Update

    For a more optimized way, in determineDelimiter() method instead of the for-each loop, you can employ regular expressions.

提交回复
热议问题