Gensim: TypeError: doc2bow expects an array of unicode tokens on input, not a single string

前端 未结 3 762
庸人自扰
庸人自扰 2020-12-06 06:05

I am starting with some python task, I am facing a problem while using gensim. I am trying to load files from my disk and process them (split them and lowercase() them)

相关标签:
3条回答
  • 2020-12-06 06:41

    Dictionary needs a tokenized strings for its input:

    dataset = ['driving car ',
               'drive car carefully',
               'student and university']
    
    # be sure to split sentence before feed into Dictionary
    dataset = [d.split() for d in dataset]
    
    vocab = Dictionary(dataset)
    
    0 讨论(0)
  • 2020-12-06 06:44

    Hello everyone i ran into the same problem. This is what worked for me

        #Tokenize the sentence into words
        tokens = [word for word in sentence.split()]
    
        #Create dictionary
        dictionary = corpora.Dictionary([tokens])
        print(dictionary)
    
    0 讨论(0)
  • 2020-12-06 06:53

    In dictionary.py, the initialize function is:

    def __init__(self, documents=None):
        self.token2id = {} # token -> tokenId
        self.id2token = {} # reverse mapping for token2id; only formed on request, to save memory
        self.dfs = {} # document frequencies: tokenId -> in how many documents this token appeared
    
        self.num_docs = 0 # number of documents processed
        self.num_pos = 0 # total number of corpus positions
        self.num_nnz = 0 # total number of non-zeroes in the BOW matrix
    
        if documents is not None:
            self.add_documents(documents)
    

    Function add_documents Build dictionary from a collection of documents. Each document is a list of tokens:

    def add_documents(self, documents):
    
        for docno, document in enumerate(documents):
            if docno % 10000 == 0:
                logger.info("adding document #%i to %s" % (docno, self))
            _ = self.doc2bow(document, allow_update=True) # ignore the result, here we only care about updating token ids
        logger.info("built %s from %i documents (total %i corpus positions)" %
                     (self, self.num_docs, self.num_pos))
    

    So ,if you initialize Dictionary in this way, you must pass documents but not a single document. For example,

    dic = corpora.Dictionary([a.split()])
    

    is OK.

    0 讨论(0)
提交回复
热议问题