Generating random sentences from custom text in Python's NLTK?

后端 未结 5 808
借酒劲吻你
借酒劲吻你 2020-12-24 03:53

I\'m having trouble with the NLTK under Python, specifically the .generate() method.

generate(self, length=100)

Print random text, generat

相关标签:
5条回答
  • 2020-12-24 04:28

    To generate random text, U need to use Markov Chains

    code to do that: from here

    import random
    
    class Markov(object):
    
      def __init__(self, open_file):
        self.cache = {}
        self.open_file = open_file
        self.words = self.file_to_words()
        self.word_size = len(self.words)
        self.database()
    
    
      def file_to_words(self):
        self.open_file.seek(0)
        data = self.open_file.read()
        words = data.split()
        return words
    
    
      def triples(self):
        """ Generates triples from the given data string. So if our string were
        "What a lovely day", we'd generate (What, a, lovely) and then
        (a, lovely, day).
        """
    
        if len(self.words) < 3:
          return
    
        for i in range(len(self.words) - 2):
          yield (self.words[i], self.words[i+1], self.words[i+2])
    
      def database(self):
        for w1, w2, w3 in self.triples():
          key = (w1, w2)
          if key in self.cache:
        self.cache[key].append(w3)
          else:
        self.cache[key] = [w3]
    
      def generate_markov_text(self, size=25):
        seed = random.randint(0, self.word_size-3)
        seed_word, next_word = self.words[seed], self.words[seed+1]
        w1, w2 = seed_word, next_word
        gen_words = []
        for i in xrange(size):
          gen_words.append(w1)
          w1, w2 = w2, random.choice(self.cache[(w1, w2)])
        gen_words.append(w2)
        return ' '.join(gen_words)
    

    Explaination: Generating pseudo random text with Markov chains using Python

    0 讨论(0)
  • 2020-12-24 04:35

    You should be "training" the Markov model with multiple sequences, so that you accurately sample the starting state probabilities as well (called "pi" in Markov-speak). If you use a single sequence then you will always start in the same state.

    In the case of Orwell's 1984 you would want to use sentence tokenization first (NLTK is very good at it), then word tokenization (yielding a list of lists of tokens, not just a single list of tokens) and then feed each sentence separately to the Markov model. This will allow it to properly model sequence starts, instead of being stuck on a single way to start every sequence.

    0 讨论(0)
  • 2020-12-24 04:36

    Maybe you can sort the tokens array randomly before generating a sentence.

    0 讨论(0)
  • 2020-12-24 04:38

    Your sample corpus is most likely to be too small. I don't know how exactly nltk builds its trigram model but it is common practice that beginning and end of sentences are handled somehow. Since there is only one beginning of sentence in your corpus this might be the reason why every sentence has the same beginning.

    0 讨论(0)
  • 2020-12-24 04:43

    Are you sure that using word_tokenize is the right approach?

    This Google groups page has the example:

    >>> import nltk
    >>> text = nltk.Text(nltk.corpus.brown.words()) # Get text from brown
    >>> text.generate() 
    

    But I've never used nltk, so I can't say whether that works the way you want.

    0 讨论(0)
提交回复
热议问题