Building a Transition Matrix using words in Python/Numpy

后端 未结 6 1749
抹茶落季
抹茶落季 2020-12-09 22:32

Im trying to build a 3x3 transition matrix with this data

days=[\'rain\', \'rain\', \'rain\', \'clouds\', \'rain\', \'sun\', \'clouds\', \'clouds\', 
  \'rai         


        
6条回答
  •  感情败类
    2020-12-09 23:12

    Here is a "pure" numpy solution it creates 3x3 tables where the zeroth dim (row number) corresponds to today and the last dim (column number) corresponds to tomorrow.

    The conversion from words to indices is done by truncating after the first letter and then using a lookup table.

    For counting numpy.add.at is used.

    This was written with efficiency in mind. It does a million words in less than a second.

    import numpy as np
    
    report = [
      'rain', 'rain', 'rain', 'clouds', 'rain', 'sun', 'clouds', 'clouds', 
      'rain', 'sun', 'rain', 'rain', 'clouds', 'clouds', 'sun', 'sun', 
      'clouds', 'clouds', 'rain', 'clouds', 'sun', 'rain', 'rain', 'sun',
      'sun', 'clouds', 'clouds', 'rain', 'rain', 'sun', 'sun', 'rain', 
      'rain', 'sun', 'clouds', 'clouds', 'sun', 'sun', 'clouds', 'rain', 
      'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 'sun', 
      'clouds', 'clouds', 'sun', 'clouds', 'rain', 'sun', 'sun', 'sun', 
      'clouds', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 
      'rain', 'clouds', 'clouds', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun', 
      'clouds', 'clouds', 'clouds', 'clouds', 'clouds', 'sun', 'rain', 
      'rain', 'rain', 'clouds', 'sun', 'clouds', 'clouds', 'clouds', 'rain', 
      'clouds', 'rain', 'sun', 'sun', 'clouds', 'sun', 'sun', 'sun', 'sun',
      'sun', 'sun', 'rain']
    
    # create np array, keep only first letter (by forcing dtype)
    # obviously, this only works because rain, sun, clouds start with different
    # letters
    # cast to int type so we can use for indexing
    ri = np.array(report, dtype='|S1').view(np.uint8)
    # create lookup
    c, r, s = 99, 114, 115 # you can verify this using chr and ord
    lookup = np.empty((s+1,), dtype=int)
    lookup[[c, r, s]] = np.arange(3)
    # translate c, r, s to 0, 1, 2
    rc = lookup[ri]
    # get counts (of pairs (today, tomorrow))
    cnts = np.zeros((3, 3), dtype=int)
    np.add.at(cnts, (rc[:-1], rc[1:]), 1)
    # or as probs
    probs = cnts / cnts.sum()
    # or as condional probs (if today is sun how probable is rain tomorrow etc.)
    cond = cnts / cnts.sum(axis=-1, keepdims=True)
    
    print(cnts)
    print(probs)
    print(cond)
    
    # [13  9 10]
    #  [ 6 11  9]
    #  [13  6 23]]
    # [[ 0.13  0.09  0.1 ]
    #  [ 0.06  0.11  0.09]
    #  [ 0.13  0.06  0.23]]
    # [[ 0.40625     0.28125     0.3125    ]
    #  [ 0.23076923  0.42307692  0.34615385]
    #  [ 0.30952381  0.14285714  0.54761905]]
    

提交回复
热议问题