Tensorflow.js tokenizer

后端 未结 3 1880
南笙
南笙 2020-12-10 17:32

I\'m new to Machine Learning and Tensorflow, since I don\'t know python so I decide to use there javascript version (maybe more like a wrapper).

The problem

3条回答
  •  北荒
    北荒 (楼主)
    2020-12-10 17:48

    To transform text to vectors, there are lots of ways to do it, all depending on the use case. The most intuitive one, is the one using the term frequency, i.e , given the vocabulary of the corpus (all the words possible), all text document will be represented as a vector where each entry represents the occurrence of the word in text document.

    With this vocabulary :

    ["machine", "learning", "is", "a", "new", "field", "in", "computer", "science"]
    

    the following text:

    ["machine", "is", "a", "field", "machine", "is", "is"] 
    

    will be transformed as this vector:

    [2, 0, 3, 1, 0, 1, 0, 0, 0] 
    

    One of the disadvantage of this technique is that there might be lots of 0 in the vector which has the same size as the vocabulary of the corpus. That is why there are others techniques. However the bag of words is often referred to. And there is a slight different version of it using tf.idf

    const vocabulary = ["machine", "learning", "is", "a", "new", "field", "in", "computer", "science"]
    const text = ["machine", "is", "a", "field", "machine", "is", "is"] 
    const parse = (t) => vocabulary.map((w, i) => t.reduce((a, b) => b === w ? ++a : a , 0))
    console.log(parse(text))

    There is also the following module that might help to achieve what you want

提交回复
热议问题