What exactly does target_vocab_size mean in the method tfds.features.text.SubwordTextEncoder.build_from_corpus?

帅比萌擦擦* 提交于 2019-12-11 16:06:19

问题


According to this link, target_vocab_size: int, approximate size of the vocabulary to create. The statement is pretty ambiguous for me. As far as I can understand, the encoder will map each vocabulary to a unique ID. What will happen if the corpus has vocab_size larger than the target_vocab_size?


回答1:


The documentation says:

Encoding is fully invertible because all out-of-vocab wordpieces are byte-encoded

Which means unknown word pieces will be encoded one character at a time. It's best understood through an example. Let's suppose you build a SubwordTextEncoder using a very large corpus of English text such that most of the common words are in vocabulary.

vocab_size = 10000
tokenizer = tfds.features.text.SubwordTextEncoder.build_from_corpus(
    corpus_sentences, vocab_size)

Let's say you try to tokenize the following sentence.

tokenizer.encode("good badwords badxyz")

It will be tokenized as:

  1. good
  2. bad
  3. words
  4. bad
  5. x
  6. y
  7. z

As you can see, since the word piece "xyz" is not in vocabulary it is tokenized as characters.



来源:https://stackoverflow.com/questions/56308612/what-exactly-does-target-vocab-size-mean-in-the-method-tfds-features-text-subwor

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!