Fast transliteration for Arabic Text with Python

前端 未结 5 1106
难免孤独
难免孤独 2021-02-06 09:14

I always work on Arabic text files and to avoid problems with encoding I transliterate Arabic characters into English according to Buckwalter\'s scheme (http://www.qamus.org/tra

5条回答
  •  星月不相逢
    2021-02-06 09:43

    Whenever you have to do transliteration str.translate is the method to use:

    >>> import timeit
    >>> buckArab = {"'":"ء", "|":"آ", "?":"أ", "&":"ؤ", "<":"إ", "}":"ئ", "A":"ا", "b":"ب", "p":"ة", "t":"ت", "v":"ث", "g":"ج", "H":"ح", "x":"خ", "d":"د", "*":"ذ", "r":"ر", "z":"ز", "s":"س", "$":"ش", "S":"ص", "D":"ض", "T":"ط", "Z":"ظ", "E":"ع", "G":"غ", "_":"ـ", "f":"ف", "q":"ق", "k":"ك", "l":"ل", "m":"م", "n":"ن", "h":"ه", "w":"و", "Y":"ى", "y":"ي", "F":"ً", "N":"ٌ", "K":"ٍ", "~":"ّ", "o":"ْ", "u":"ُ", "a":"َ", "i":"ِ"}
    >>> def repl(data, table):
    ...     for k,v in table.iteritems():
    ...         data = data.replace(k, v)
    ... 
    >>> def trans(data, table):
    ...     return data.translate(table)
    ... 
    >>> T = u'This is a test to see how fast is translitteration'
    >>> timeit.timeit('trans(T, buckArab)', 'from __main__ import trans, T, buckArab', number=10**6)
    6.766200065612793
    >>> T = 'This is a test to see how fast is translitteration' #in python2 requires ASCII string
    >>> timeit.timeit('repl(T, buckArab)', 'from __main__ import repl, T, buckArab', number=10**6)
    12.668706893920898
    

    As you can see even for small strings str.translate is 2 times faster.

提交回复
热议问题