Quick implementation of character n-grams for word

前端 未结 3 1536
花落未央
花落未央 2020-12-01 12:13

I wrote the following code for computing character bigrams and the output is right below. My question is, how do I get an output that excludes the last character (ie t)? and

3条回答
  •  被撕碎了的回忆
    2020-12-01 13:00

    Try zip:

    >>> def word2ngrams(text, n=3, exact=True):
    ...   """ Convert text into character ngrams. """
    ...   return ["".join(j) for j in zip(*[text[i:] for i in range(n)])]
    ... 
    >>> word2ngrams('foobarbarblacksheep')
    ['foo', 'oob', 'oba', 'bar', 'arb', 'rba', 'bar', 'arb', 'rbl', 'bla', 'lac', 'ack', 'cks', 'ksh', 'she', 'hee', 'eep']
    

    but do note that it's slower:

    import string, random, time
    
    def zip_ngrams(text, n=3, exact=True):
      return ["".join(j) for j in zip(*[text[i:] for i in range(n)])]
    
    def nozip_ngrams(text, n=3):
        return [text[i:i+n] for i in range(len(text)-n+1)]
    
    # Generate 10000 random strings of length 100.
    words = [''.join(random.choice(string.ascii_uppercase) for j in range(100)) for i in range(10000)]
    
    start = time.time()
    x = [zip_ngrams(w) for w in words]
    print time.time() - start
    
    start = time.time()
    y = [nozip_ngrams(w) for w in words]
    print time.time() - start        
    
    print x==y
    

    [out]:

    0.314492940903
    0.197558879852
    True
    

提交回复
热议问题