Gensim: TypeError: doc2bow expects an array of unicode tokens on input, not a single string

前端 未结 3 763
庸人自扰
庸人自扰 2020-12-06 06:05

I am starting with some python task, I am facing a problem while using gensim. I am trying to load files from my disk and process them (split them and lowercase() them)

3条回答
  •  -上瘾入骨i
    2020-12-06 06:53

    In dictionary.py, the initialize function is:

    def __init__(self, documents=None):
        self.token2id = {} # token -> tokenId
        self.id2token = {} # reverse mapping for token2id; only formed on request, to save memory
        self.dfs = {} # document frequencies: tokenId -> in how many documents this token appeared
    
        self.num_docs = 0 # number of documents processed
        self.num_pos = 0 # total number of corpus positions
        self.num_nnz = 0 # total number of non-zeroes in the BOW matrix
    
        if documents is not None:
            self.add_documents(documents)
    

    Function add_documents Build dictionary from a collection of documents. Each document is a list of tokens:

    def add_documents(self, documents):
    
        for docno, document in enumerate(documents):
            if docno % 10000 == 0:
                logger.info("adding document #%i to %s" % (docno, self))
            _ = self.doc2bow(document, allow_update=True) # ignore the result, here we only care about updating token ids
        logger.info("built %s from %i documents (total %i corpus positions)" %
                     (self, self.num_docs, self.num_pos))
    

    So ,if you initialize Dictionary in this way, you must pass documents but not a single document. For example,

    dic = corpora.Dictionary([a.split()])
    

    is OK.

提交回复
热议问题