In Python, how to check if a string only contains certain characters?

后端 未结 7 2080
悲&欢浪女
悲&欢浪女 2020-12-02 12:36

In Python, how to check if a string only contains certain characters?

I need to check a string containing only a..z, 0..9, and . (period) and no other character.

7条回答
  •  鱼传尺愫
    2020-12-02 13:27

    This has already been answered satisfactorily, but for people coming across this after the fact, I have done some profiling of several different methods of accomplishing this. In my case I wanted uppercase hex digits, so modify as necessary to suit your needs.

    Here are my test implementations:

    import re
    
    hex_digits = set("ABCDEF1234567890")
    hex_match = re.compile(r'^[A-F0-9]+\Z')
    hex_search = re.compile(r'[^A-F0-9]')
    
    def test_set(input):
        return set(input) <= hex_digits
    
    def test_not_any(input):
        return not any(c not in hex_digits for c in input)
    
    def test_re_match1(input):
        return bool(re.compile(r'^[A-F0-9]+\Z').match(input))
    
    def test_re_match2(input):
        return bool(hex_match.match(input))
    
    def test_re_match3(input):
        return bool(re.match(r'^[A-F0-9]+\Z', input))
    
    def test_re_search1(input):
        return not bool(re.compile(r'[^A-F0-9]').search(input))
    
    def test_re_search2(input):
        return not bool(hex_search.search(input))
    
    def test_re_search3(input):
        return not bool(re.match(r'[^A-F0-9]', input))
    

    And the tests, in Python 3.4.0 on Mac OS X:

    import cProfile
    import pstats
    import random
    
    # generate a list of 10000 random hex strings between 10 and 10009 characters long
    # this takes a little time; be patient
    tests = [ ''.join(random.choice("ABCDEF1234567890") for _ in range(l)) for l in range(10, 10010) ]
    
    # set up profiling, then start collecting stats
    test_pr = cProfile.Profile(timeunit=0.000001)
    test_pr.enable()
    
    # run the test functions against each item in tests. 
    # this takes a little time; be patient
    for t in tests:
        for tf in [test_set, test_not_any, 
                   test_re_match1, test_re_match2, test_re_match3,
                   test_re_search1, test_re_search2, test_re_search3]:
            _ = tf(t)
    
    # stop collecting stats
    test_pr.disable()
    
    # we create our own pstats.Stats object to filter 
    # out some stuff we don't care about seeing
    test_stats = pstats.Stats(test_pr)
    
    # normally, stats are printed with the format %8.3f, 
    # but I want more significant digits
    # so this monkey patch handles that
    def _f8(x):
        return "%11.6f" % x
    
    def _print_title(self):
        print('   ncalls     tottime     percall     cumtime     percall', end=' ', file=self.stream)
        print('filename:lineno(function)', file=self.stream)
    
    pstats.f8 = _f8
    pstats.Stats.print_title = _print_title
    
    # sort by cumulative time (then secondary sort by name), ascending
    # then print only our test implementation function calls:
    test_stats.sort_stats('cumtime', 'name').reverse_order().print_stats("test_*")
    

    which gave the following results:

             50335004 function calls in 13.428 seconds
    
       Ordered by: cumulative time, function name
       List reduced from 20 to 8 due to restriction 
    
       ncalls     tottime     percall     cumtime     percall filename:lineno(function)
        10000    0.005233    0.000001    0.367360    0.000037 :1(test_re_match2)
        10000    0.006248    0.000001    0.378853    0.000038 :1(test_re_match3)
        10000    0.010710    0.000001    0.395770    0.000040 :1(test_re_match1)
        10000    0.004578    0.000000    0.467386    0.000047 :1(test_re_search2)
        10000    0.005994    0.000001    0.475329    0.000048 :1(test_re_search3)
        10000    0.008100    0.000001    0.482209    0.000048 :1(test_re_search1)
        10000    0.863139    0.000086    0.863139    0.000086 :1(test_set)
        10000    0.007414    0.000001    9.962580    0.000996 :1(test_not_any)
    

    where:

    ncalls
    The number of times that function was called
    tottime
    the total time spent in the given function, excluding time made to sub-functions
    percall
    the quotient of tottime divided by ncalls
    cumtime
    the cumulative time spent in this and all subfunctions
    percall
    the quotient of cumtime divided by primitive calls

    The columns we actually care about are cumtime and percall, as that shows us the actual time taken from function entry to exit. As we can see, regex match and search are not massively different.

    It is faster not to bother compiling the regex if you would have compiled it every time. It is about 7.5% faster to compile once than every time, but only 2.5% faster to compile than to not compile.

    test_set was twice as slow as re_search and thrice as slow as re_match

    test_not_any was a full order of magnitude slower than test_set

    TL;DR: Use re.match or re.search

提交回复
热议问题