Doc2vec : TaggedLineDocument()

问题

So,I'm trying to learn and understand Doc2Vec. I'm following this tutorial. My input is a list of documents i.e list of lists of words. This is what my code looks like:

    input = [["word1","word2",..."wordn"],["word1","word2",..."wordn"],...] 

    documents = TaggedLineDocument(input)

    model = doc2vec.Doc2Vec(documents,size = 50, window = 10, min_count = 2, workers=2)

But I am getting some unicode error(tried googling this error, but no good ):

   TypeError('don\'t know how to handle uri %s' % repr(uri))

Can somebody please help me understand where i am going wrong ? Thank you !

回答1:

TaggedLineDocument should be instantiated with a file path. Make sure the file is setup in the format one document equals one line.

documents = TaggedLineDocument('myfile.txt')
documents = TaggedLineDocument('compressed_text.txt.gz')

From the source code:

The uri (the think you are instantiating TaggedLineDocument with) can be either:

1. a URI for the local filesystem (compressed ``.gz`` or ``.bz2`` files handled automatically):
   `./lines.txt`, `/home/joe/lines.txt.gz`, `file:///home/joe/lines.txt.bz2`
2. a URI for HDFS: `hdfs:///some/path/lines.txt`
3. a URI for Amazon's S3 (can also supply credentials inside the URI):
   `s3://my_bucket/lines.txt`, `s3://my_aws_key_id:key_secret@my_bucket/lines.txt`
4. an instance of the boto.s3.key.Key class.

回答2:

For the data, I have the same formatted list as yours:

[['aw', 'wb', 'ce', 'uw', 'qqg'], ['g', 'e', 'ent', 'va'],['a']...]

For the labels, I have a list: [1, 0, 0 ...] It indicates the class of my above sentences, you can have any class(tag) at here(not only 1 or 0)

Since we already have the list like above, we can use TaggedDocumnet directly, instead of TaggedLineDocument

    model = gensim.models.Doc2Vec(self.myDataFlow(data,labels))

    def myDataFlow(self,data,labels):
    for i, j in zip(data,labels):
        yield TaggedDocument(i,[j])

来源：https://stackoverflow.com/questions/36780138/doc2vec-taggedlinedocument

标签

python

nlp

gensim