问题
I'm trying to convert a string to a numeric equivalent so I can train a neural-network to classify the strings. I tried the sum of the ascii values, but that just results in larger numbers vs smaller numbers.
For example, I could have a short string in german and it puts it into the english class because the english words that it has been trained with are short and numerically small.
I was looking into Google's word2vec, which seems like it should work. But I want to do this on the client-side. And I found a node.js implementation, here, but that just runs the command-line tool.
How can I convert a string to something numeric, a vector perhaps in js?
回答1:
I'm sure you've considered assigning each new word you encounter an integer. You'll have to keep track somewhere, but that's one option.
You could also use whatever built-in hash method js has.
If you don't mind a few hash collisions, and the size of the resulting integers doesn't matter, may I recommend a trick I've used a few times before.
- Assign each letter a prime number based on its frequency:

So, e = 2
, t=3
, a=5
, etc., which gives us:
2 e
3 t
5 a
7 o
11 i
13 n
17 s
19 h
23 r
29 d
31 l
37 c
41 u
43 m
47 w
53 f
59 g
61 y
67 p
71 b
73 v
79 k
83 j
89 x
97 q
101 z
- Multiply the value corresponding with each letter in a word
So, value
is 73*5*31*41*2
. corresponding
is 37*7*23*23...
. Each unique set gives a unique answer. It collides for anagrams, so we've accidentally built an anagram detector.
There isn't really a linguistically sound way to do this, though. I suspect word2vec
just assigns arbitrary integers to strings.
来源:https://stackoverflow.com/questions/29880071/convert-nl-string-to-vector-or-some-numeric-equivalent