Generating random sentences from custom text in Python's NLTK?

后端未结

关注

 5  808

借酒劲吻你

I\'m having trouble with the NLTK under Python, specifically the .generate() method.

generate(self, length=100)

Print random text, generat

相关标签:

5条回答

醉话见心

2020-12-24 04:28

To generate random text, U need to use Markov Chains

code to do that: from here

import random

class Markov(object):

  def __init__(self, open_file):
    self.cache = {}
    self.open_file = open_file
    self.words = self.file_to_words()
    self.word_size = len(self.words)
    self.database()


  def file_to_words(self):
    self.open_file.seek(0)
    data = self.open_file.read()
    words = data.split()
    return words


  def triples(self):
    """ Generates triples from the given data string. So if our string were
    "What a lovely day", we'd generate (What, a, lovely) and then
    (a, lovely, day).
    """

    if len(self.words) < 3:
      return

    for i in range(len(self.words) - 2):
      yield (self.words[i], self.words[i+1], self.words[i+2])

  def database(self):
    for w1, w2, w3 in self.triples():
      key = (w1, w2)
      if key in self.cache:
    self.cache[key].append(w3)
      else:
    self.cache[key] = [w3]

  def generate_markov_text(self, size=25):
    seed = random.randint(0, self.word_size-3)
    seed_word, next_word = self.words[seed], self.words[seed+1]
    w1, w2 = seed_word, next_word
    gen_words = []
    for i in xrange(size):
      gen_words.append(w1)
      w1, w2 = w2, random.choice(self.cache[(w1, w2)])
    gen_words.append(w2)
    return ' '.join(gen_words)

Explaination: Generating pseudo random text with Markov chains using Python

0 讨论(0)

Happy的楠姐

2020-12-24 04:35

You should be "training" the Markov model with multiple sequences, so that you accurately sample the starting state probabilities as well (called "pi" in Markov-speak). If you use a single sequence then you will always start in the same state.

In the case of Orwell's 1984 you would want to use sentence tokenization first (NLTK is very good at it), then word tokenization (yielding a list of lists of tokens, not just a single list of tokens) and then feed each sentence separately to the Markov model. This will allow it to properly model sequence starts, instead of being stuck on a single way to start every sequence.

0 讨论(0)
发布评论:

提交评论
- 加载中...
甜味超标

2020-12-24 04:36

Maybe you can sort the tokens array randomly before generating a sentence.

0 讨论(0)
发布评论:

提交评论
- 加载中...
南笙

2020-12-24 04:38

Your sample corpus is most likely to be too small. I don't know how exactly nltk builds its trigram model but it is common practice that beginning and end of sentences are handled somehow. Since there is only one beginning of sentence in your corpus this might be the reason why every sentence has the same beginning.

0 讨论(0)
发布评论:

提交评论
- 加载中...
孤城傲影

2020-12-24 04:43
Are you sure that using word_tokenize is the right approach?

This Google groups page has the example:
```
>>> import nltk
>>> text = nltk.Text(nltk.corpus.brown.words()) # Get text from brown
>>> text.generate() 
```
But I've never used nltk, so I can't say whether that works the way you want.
0 讨论(0)
发布评论:

提交评论
- 加载中...