问题
So,I'm trying to learn and understand Doc2Vec. I'm following this tutorial. My input is a list of documents i.e list of lists of words. This is what my code looks like:
input = [["word1","word2",..."wordn"],["word1","word2",..."wordn"],...]
documents = TaggedLineDocument(input)
model = doc2vec.Doc2Vec(documents,size = 50, window = 10, min_count = 2, workers=2)
But I am getting some unicode error(tried googling this error, but no good ):
TypeError('don\'t know how to handle uri %s' % repr(uri))
Can somebody please help me understand where i am going wrong ? Thank you !
回答1:
TaggedLineDocument should be instantiated with a file path. Make sure the file is setup in the format one document equals one line.
documents = TaggedLineDocument('myfile.txt')
documents = TaggedLineDocument('compressed_text.txt.gz')
From the source code:
The uri (the think you are instantiating TaggedLineDocument with) can be either:
1. a URI for the local filesystem (compressed ``.gz`` or ``.bz2`` files handled automatically):
`./lines.txt`, `/home/joe/lines.txt.gz`, `file:///home/joe/lines.txt.bz2`
2. a URI for HDFS: `hdfs:///some/path/lines.txt`
3. a URI for Amazon's S3 (can also supply credentials inside the URI):
`s3://my_bucket/lines.txt`, `s3://my_aws_key_id:key_secret@my_bucket/lines.txt`
4. an instance of the boto.s3.key.Key class.
回答2:
For the data, I have the same formatted list as yours:
[['aw', 'wb', 'ce', 'uw', 'qqg'], ['g', 'e', 'ent', 'va'],['a']...]
For the labels, I have a list: [1, 0, 0 ...] It indicates the class of my above sentences, you can have any class(tag) at here(not only 1 or 0)
Since we already have the list like above, we can use TaggedDocumnet directly, instead of TaggedLineDocument
model = gensim.models.Doc2Vec(self.myDataFlow(data,labels))
def myDataFlow(self,data,labels):
for i, j in zip(data,labels):
yield TaggedDocument(i,[j])
来源:https://stackoverflow.com/questions/36780138/doc2vec-taggedlinedocument