Fast transliteration for Arabic Text with Python

前端 未结 5 1143
难免孤独
难免孤独 2021-02-06 09:14

I always work on Arabic text files and to avoid problems with encoding I transliterate Arabic characters into English according to Buckwalter\'s scheme (http://www.qamus.org/tra

5条回答
  •  忘掉有多难
    2021-02-06 09:36

    Incidentally, someone already wrote a script that does this, so you might want to check that out before spending too much time on your own: buckwalter2unicode.py

    It probably does more than what you need, but you don't have to use all of it: I copied just the two dictionaries and the transliterateString function (with a few tweaks, I think), and use that on my site.

    Edit: The script above is what I have been using, but I'm just discovered that it is much slower than using replace, especially for a large corpus. This is the code I finally ended up with, that seems to be simpler and faster (this references a dictionary buck2uni):

    def transString(string, reverse=0):
        '''Given a Unicode string, transliterate into Buckwalter. To go from
        Buckwalter back to Unicode, set reverse=1'''
    
        for k, v in buck2uni.items():
            if not reverse:
                string = string.replace(v, k)
            else:
                string = string.replace(k, v)
    
        return string
    

提交回复
热议问题