How to find whether a particular string has unicode characters (esp. Double Byte characters)

后端 未结 6 891
我在风中等你
我在风中等你 2020-12-08 13:55

To be more precise, I need to know whether (and if possible, how) I can find whether a given string has double byte characters or not. Basically, I need to open a pop-up to

6条回答
  •  没有蜡笔的小新
    2020-12-08 14:26

    I have benchmarked the two functions in the top answers and thought I would share the results. Here is the test code I used:

    const text1 = `The Chinese Wikipedia was established along with 12 other Wikipedias in May 2001. 中文維基百科的副標題是「海納百川,有容乃大」,這是中国的清朝政治家林则徐(1785年-1850年)於1839年為`;
    
    const regex = /[^\u0000-\u00ff]/; // Small performance gain from pre-compiling the regex
    function containsNonLatinCodepoints(s) {
        return regex.test(s);
    }
    
    function isDoubleByte(str) {
        for (var i = 0, n = str.length; i < n; i++) {
            if (str.charCodeAt( i ) > 255) { return true; }
        }
        return false;
    }
    
    function benchmark(fn, str) {
        let startTime = new Date();
        for (let i = 0; i < 10000000; i++) {
            fn(str);
        }   
        let endTime = new Date();
    
        return endTime.getTime() - startTime.getTime();
    }
    
    console.info('isDoubleByte => ' + benchmark(isDoubleByte, text1));
    console.info('containsNonLatinCodepoints => ' + benchmark(containsNonLatinCodepoints, text1));
    

    When running this I got:

    isDoubleByte => 2421
    containsNonLatinCodepoints => 868
    

    So for this particular string the regex solution is about 3 times faster.

    However note that for a string where the first character is unicode, isDoubleByte() returns right away and so is much faster than the regex (which still has the overhead of the regular expression).

    For instance for the string 中国, I got these results:

    isDoubleByte => 51
    containsNonLatinCodepoints => 288
    

    To get the best of both world, it's probably better to combine both:

    var regex = /[^\u0000-\u00ff]/; // Small performance gain from pre-compiling the regex
    function containsDoubleByte(str) {
        if (!str.length) return false;
        if (str.charCodeAt(0) > 255) return true;
        return regex.test(str);
    }
    

    In that case, if the first character is Chinese (which is likely if the whole text is Chinese), the function will be fast and return right away. If not, it will run the regex, which is still faster than checking each character individually.

提交回复
热议问题