Convert nl string to vector or some numeric equivalent

好久不见. 提交于 2020-01-16 12:09:08

问题


I'm trying to convert a string to a numeric equivalent so I can train a neural-network to classify the strings. I tried the sum of the ascii values, but that just results in larger numbers vs smaller numbers.

For example, I could have a short string in german and it puts it into the english class because the english words that it has been trained with are short and numerically small.

I was looking into Google's word2vec, which seems like it should work. But I want to do this on the client-side. And I found a node.js implementation, here, but that just runs the command-line tool.

How can I convert a string to something numeric, a vector perhaps in js?


回答1:


I'm sure you've considered assigning each new word you encounter an integer. You'll have to keep track somewhere, but that's one option.

You could also use whatever built-in hash method js has.

If you don't mind a few hash collisions, and the size of the resulting integers doesn't matter, may I recommend a trick I've used a few times before.

  • Assign each letter a prime number based on its frequency:

So, e = 2, t=3, a=5, etc., which gives us:

2       e
3       t
5       a
7       o
11      i
13      n
17      s
19      h
23      r
29      d
31      l
37      c
41      u
43      m
47      w
53      f
59      g
61      y
67      p
71      b
73      v   
79      k
83      j
89      x
97      q
101     z
  • Multiply the value corresponding with each letter in a word

So, value is 73*5*31*41*2. corresponding is 37*7*23*23.... Each unique set gives a unique answer. It collides for anagrams, so we've accidentally built an anagram detector.

There isn't really a linguistically sound way to do this, though. I suspect word2vec just assigns arbitrary integers to strings.



来源:https://stackoverflow.com/questions/29880071/convert-nl-string-to-vector-or-some-numeric-equivalent

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!