How to split a Thai sentence, which does not use spaces, into words?

后端 未结 4 1074
不思量自难忘°
不思量自难忘° 2021-02-19 06:46

How to split word from Thai sentence? English we can split word by space.

Example: I go to school, split = [\'I\', \'go\', \'to\' ,\'school\']

相关标签:
4条回答
  • 2021-02-19 07:12

    There are multiple ways to do 'Thai words tokenization'. One way is to use dictionary-based or pattern-based. In this case, the algorithm will go through characters and if it appears in the dictionary, we'll count as a word.

    Also, there are also recent libraries to tokenize Thai text where it trained Deep learning to tokenize Thai word on BEST corpus including rkcosmos/deepcut, pucktada/cutkum and more.

    Example usage of deepcut:

    import deepcut
    deepcut.tokenize('ฉันจะไปโรงเรียน')
    # output as ['ฉัน', 'จะ', 'ไป', 'โรง', 'เรียน']
    
    0 讨论(0)
  • 2021-02-19 07:14

    Here's how to split Thai text into words using Kotlin and ICU4J. ICU4J is a better choice than Lucene's version (last updated 6/2011), because ICU4J is constantly updated and has additional related tools. Search for icu4j at mvnrepository.com to see them all.

     fun splitIntoWords(s: String): List<String> {
        val wordBreaker = BreakIterator.getWordInstance(Locale("th"));
        wordBreaker.setText(s)
        var startPos = wordBreaker.first()
        var endPos = wordBreaker.next()
    
        val words = mutableListOf<String>()
    
        while(endPos != BreakIterator.DONE) {
            words.add(s.substring(startPos,endPos))
            startPos = endPos
            endPos = wordBreaker.next()
        }
    
        return words.toMutableList()
    }
    
    0 讨论(0)
  • 2021-02-19 07:16

    In 2006, someone contributed code to the Apache Lucene project to make this work.

    Their approach (written in Java) was to use the BreakIterator class, calling getWordInstance() to get a dictionary-based word iterator for the Thai language. Note also that there is a stated dependency on the ICU4J project. I have pasted the relevant section of their code below:

      private BreakIterator breaker = null;
      private Token thaiToken = null;
    
      public ThaiWordFilter(TokenStream input) {
        super(input);
        breaker = BreakIterator.getWordInstance(new Locale("th"));
      }
    
      public Token next() throws IOException {
        if (thaiToken != null) {
          String text = thaiToken.termText();
          int start = breaker.current();
          int end = breaker.next();
          if (end != BreakIterator.DONE) {
            return new Token(text.substring(start, end), 
                thaiToken.startOffset()+start,
                thaiToken.startOffset()+end, thaiToken.type());
          }
          thaiToken = null;
        }
        Token tk = input.next();
        if (tk == null) {
          return null;
        }
        String text = tk.termText();
        if (UnicodeBlock.of(text.charAt(0)) != UnicodeBlock.THAI) {
          return new Token(text.toLowerCase(), 
                           tk.startOffset(), 
                           tk.endOffset(), 
                           tk.type());
        }
        thaiToken = tk;
        breaker.setText(text);
        int end = breaker.next();
        if (end != BreakIterator.DONE) {
          return new Token(text.substring(0, end), 
              thaiToken.startOffset(), 
              thaiToken.startOffset()+end,
              thaiToken.type());
        }
        return null;
      }
    
    0 讨论(0)
  • 2021-02-19 07:23

    The simplest segmenter for Chinese and Japanese is to use a greedy dictionary based scheme. This should work just as well for Thai---get a dictionary of Thai words, and at the current character, match the longest string from that character that exists in the dictionary. This gets you a pretty decent segmenter, at least in Chinese and Japanese.

    0 讨论(0)
提交回复
热议问题