Im trying to build a 3x3 transition matrix with this data
days=[\'rain\', \'rain\', \'rain\', \'clouds\', \'rain\', \'sun\', \'clouds\', \'clouds\',
\'rai
Here is a "pure" numpy solution it creates 3x3 tables where the zeroth dim (row number) corresponds to today and the last dim (column number) corresponds to tomorrow.
The conversion from words to indices is done by truncating after the first letter and then using a lookup table.
For counting numpy.add.at
is used.
This was written with efficiency in mind. It does a million words in less than a second.
import numpy as np
report = [
'rain', 'rain', 'rain', 'clouds', 'rain', 'sun', 'clouds', 'clouds',
'rain', 'sun', 'rain', 'rain', 'clouds', 'clouds', 'sun', 'sun',
'clouds', 'clouds', 'rain', 'clouds', 'sun', 'rain', 'rain', 'sun',
'sun', 'clouds', 'clouds', 'rain', 'rain', 'sun', 'sun', 'rain',
'rain', 'sun', 'clouds', 'clouds', 'sun', 'sun', 'clouds', 'rain',
'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 'sun',
'clouds', 'clouds', 'sun', 'clouds', 'rain', 'sun', 'sun', 'sun',
'clouds', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds',
'rain', 'clouds', 'clouds', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun',
'clouds', 'clouds', 'clouds', 'clouds', 'clouds', 'sun', 'rain',
'rain', 'rain', 'clouds', 'sun', 'clouds', 'clouds', 'clouds', 'rain',
'clouds', 'rain', 'sun', 'sun', 'clouds', 'sun', 'sun', 'sun', 'sun',
'sun', 'sun', 'rain']
# create np array, keep only first letter (by forcing dtype)
# obviously, this only works because rain, sun, clouds start with different
# letters
# cast to int type so we can use for indexing
ri = np.array(report, dtype='|S1').view(np.uint8)
# create lookup
c, r, s = 99, 114, 115 # you can verify this using chr and ord
lookup = np.empty((s+1,), dtype=int)
lookup[[c, r, s]] = np.arange(3)
# translate c, r, s to 0, 1, 2
rc = lookup[ri]
# get counts (of pairs (today, tomorrow))
cnts = np.zeros((3, 3), dtype=int)
np.add.at(cnts, (rc[:-1], rc[1:]), 1)
# or as probs
probs = cnts / cnts.sum()
# or as condional probs (if today is sun how probable is rain tomorrow etc.)
cond = cnts / cnts.sum(axis=-1, keepdims=True)
print(cnts)
print(probs)
print(cond)
# [13 9 10]
# [ 6 11 9]
# [13 6 23]]
# [[ 0.13 0.09 0.1 ]
# [ 0.06 0.11 0.09]
# [ 0.13 0.06 0.23]]
# [[ 0.40625 0.28125 0.3125 ]
# [ 0.23076923 0.42307692 0.34615385]
# [ 0.30952381 0.14285714 0.54761905]]