问题
According to this link, target_vocab_size:
int, approximate size of the vocabulary to create. The statement is pretty ambiguous for me. As far as I can understand, the encoder will map each vocabulary to a unique ID. What will happen if the corpus has vocab_size
larger than the target_vocab_size
?
回答1:
The documentation says:
Encoding is fully invertible because all out-of-vocab wordpieces are byte-encoded
Which means unknown word pieces will be encoded one character at a time. It's best understood through an example. Let's suppose you build a SubwordTextEncoder
using a very large corpus of English text such that most of the common words are in vocabulary.
vocab_size = 10000
tokenizer = tfds.features.text.SubwordTextEncoder.build_from_corpus(
corpus_sentences, vocab_size)
Let's say you try to tokenize the following sentence.
tokenizer.encode("good badwords badxyz")
It will be tokenized as:
- good
- bad
- words
- bad
- x
- y
- z
As you can see, since the word piece "xyz" is not in vocabulary it is tokenized as characters.
来源:https://stackoverflow.com/questions/56308612/what-exactly-does-target-vocab-size-mean-in-the-method-tfds-features-text-subwor